Tutorial: Installing paprica on Mac OSX

The following is a paprica installation tutorial for novice users on Mac OSX (installation on Linux is quite a bit simpler). If you’re comfortable editing your PATH and installing things using bash you probably don’t need to follow this tutorial, just get the dependencies as indicated in the manual. If command line operations stress you out, and you haven’t dealt with a lot of weird bioinformatics program installs, use this tutorial.

Please note that this tutorial is a work in progress.  If you notice errors, inconsistencies, or omissions please leave a comment and I’ll be sure to correct them.

** IMPORTANT ** It is generally considered very poor practice to install anything in the root directory.  You might think, “but I’m the only user, so this makes more sense” or “but everyone in the lab wants program X, so I should install as root.”  Don’t do it.  Install to your home directory.  It will add years to your life.

This tutorial assumes you’ve followed this advice, and that you are installing all the dependencies in your home directory.

Install Python and Python packages

paprica is 90 % an elaborate wrapper script (or set of scripts) for several core programs written by other groups. The scripts that execute the pipeline are bash scripts, the scripts that do the actual work are Python. Therefore you need Python up and running on your system. If you already have a mainstream v2.7 Python distro going  just make sure that the biopython, joblib, and pandas modules are installed and you’re good to go.

Install biopython( version 1.66), joblib, and pandas:

pip install biopython==1.66
pip install joblib
pip install pandas

In case you have conflicts with other Python installations, or some other mysterious problems, it’s a good idea to test things out at this point. Open a shell, type “Python”  and:

import Bio
import joblib
import pandas

If you get any error messages something somewhere is wrong. Burn some incense and try again. If that doesn’t work try holy water.

Install Homebrew and GSL

One challenge with paprica on OSX has to do with the excellent program pplacer. The pplacer binary for Darwin needs the Gnu Scientific Library (GSL), specifically v1.16 (at the time of writing – note that later versions will not work with pplacer!). You can try to compile this from source, but I’ve had trouble getting this to work on OSX. The easier option is to use a package manager, preferably Homebrew. This means however, that you have to marry one of the OSX package managers and never look back. Fink, Macports, and Homebrew will all get you a working version of GSL. I recommend using Homebrew.

To download Homebrew (assuming you don’t already have it) type:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Follow the on-screen instructions. Once it is downloaded type:

brew tap homebrew/versions
brew install GSL@1

## On the test platform homebrew didn't want to install to
## the default location, pplacer requires the GSL library files
## be in this location, unless you want to install from source,
## which you don't...

#Note that /usr is a system of libraries and executables (It's an actual directory and not your username)

cp /usr/local/opt/gsl@1/lib/*dylib /usr/local/lib/

This should install the Gnu Scientific Library v1.16 in the default location.

Homebrew has a very useful tool:wget and is used for downloading installation files among other stuff that you will use in this tutorial. You have to install wget by:

brew install wget

Install Infernal and pplacer

Assuming all that went okay go ahead and download the software you need to execute just the paprica-run.sh portion of paprica. First, the excellent aligner Infernal. From your home directory:

wget http://eddylab.org/infernal/infernal-1.1.1-macosx-intel.tar.gz
tar -xzvf infernal-1.1.1-macosx-intel.tar.gz
mv infernal-1.1.1-macosx-intel infernal

Then pplacer, which also includes Guppy:

wget https://github.com/matsen/pplacer/releases/download/v1.1.alpha17/pplacer-Darwin-v1.1.alpha17.zip
unzip pplacer-Darwin-v1.1.alpha17.zip
mv pplacer-Darwin-v1.1.alpha17-6-g5cecf99 pplacer

Add dependencies to PATH

Now comes the tricky bit, you need to add the locations of the executables for these programs to your PATH variable.  This is a pretty important basic computing skill to master.  Try not to screw it up.  It isn’t hard to undo screw-ups, but it will freak you out because bash will suddenly be unable to find programs that it could find before. Before you continue please read the excellent summary of shell startup scripts as they pertain to OSX here:

http://hayne.net/MacDev/Notes/unixFAQ.html#shellStartup

This tutorial assumes that you are using paprica from a log-in shell.  A log-in shell means that 1) you’ve logged into the OSX GUI and opened a terminal or 2) You’ve logged in remotely via SSH.  If you “screen” a shell session you’ve opened a non-log in shell and will need to do things slightly differently.  If you’re using screen you probably know how to modify your PATH appropriately already, if not consult the link above (hint: I find that the easiest way to avoid a potential startup file quagmire is to source .bashrc from .bash_profile and otherwise only modify .bashrc).

For most users, during install Anaconda created either a .bash_profile or .profile file and added itself to your PATH in that file.  Different users have reported different behavior with the Anaconda installer, and alternate distros might do something different.  If .bash_profile exists it trumps .profile.  Check to see which you have.  In your home directory type:

ls -a

If you see both .profile and .bash_profile I recommend moving all the information in .profile to .bash_profile (using nano as described below, or a GUI text editor).  Alternatively add the following line to the bottom of .bash_profile.

source .profile

This will force the shell to do whatever is indicated in .profile after it’s done what’s in .bash_profile.  Hopefully however, you only have .profile or .bash_profile.  Open whichever you have (or .bash_profile if you have both) using nano:

## If you have .bash_profile clearly you would type nano .bash_profile
 or nano .profile

Navigate to the end of the file and paste the following, modifying as necessary (note: there are lots of syntactic variations for adding a location to PATH, the below commands are a little redundant but clear and easy to modify):

export PATH=/Users/your-user-name/infernal/binaries:${PATH}
export PATH=/Users/your-user-name/pplacer:${PATH}
export PATH=/Users/your-user-name/paprica:${PATH}

Don’t be the guy or gal who types your-user-name. Replace with your actual user name. Hit ctrl-o to write out the file, enter to save, and ctrl-x to exit nano. Re-source .profile by typing:

source .profile

#Or if you have .bash_profile

source .bash_profile

Confirm that you can execute the following programs by navigating to your home directory and executing each of the following commands:

cmalign
esl-alimerge
pplacer
guppy

You should get an error message that is clearly from the program, not a bash error like “command not found”.

Install Seqmagick

Now you need to install the final dependency, Seqmagick. Confirm the most current stable release by going to Github, then download it:

wget https://github.com/fhcrc/seqmagick/archive/0.6.1.tar.gz
tar -xzvf 0.6.1.tar.gz
pip install seqmagick

Check the installation by typing:

seqmagick

You should get a sensible error that is clearly seqmagick yelling at you.

Get paprica

Okay, now you are ready to get paprica and do some analysis! You can clone the latest repository here :

git clone https://github.com/bowmanjeffs/paprica.git

Now make paprica-run.sh and python scripts executable.

cd paprica 
chmod a+x paprica-run.sh 
chmod a+x *py

At this point you should be ready to rock. Take a deep breath and type:

./paprica-run.sh test.bacteria bacteria

This analyzes the file test.bacteria.fasta against the bacteria database.  You should see a lot of output flash by on the screen, and you should see a number of new files in the directory with the prefix test.bacteria.  Checkout the paprica analysis tutorial and manual for more info on these files.

To run your own analysis, say on amazing_sample.fasta against the bacteria database, simply type:

./paprica-run.sh amazing_sample bacteria

Please, please, please, read the manual (included in the paprica download) for further details, such as how to greatly decrease the run time on large fasta files, and how to sub-sample your input fasta. Remember that the fasta file you input should contain only reads you are reasonably sure come from the domain you’ve specified, and they should be properly QC’d (i.e. low quality ends and adapters and barcodes and such trimmed away).

2899 Total Views 1 Views Today
This entry was posted in paprica. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments Protected by WP-SpamShield for WordPress