This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.
Building the paprica database provides maximum flexibility but involves more moving parts and resources than using the provided database. Basic instructions for using the paprica-build.sh script are provided in the manual, this tutorial is intended to provide an even more detailed step-by-step guide.
While a laptop running Linux, Windows with VirtualBox, or OSX is perfectly adequate for analysis with paprica, you’ll need something a little beefier for building the database (unless you’re really patient). A high performance cluster is overkill; I build the provided database on a basic 12 core Linux workstation with 32 Gb RAM (< $5k). Something in this ballpark should work fine, of course more cores will get the job done faster (but keep an eye on memory useage).
Once you’ve got the hardware requirements sorted out you need to download the dependencies. I recommend first following all the instructions for the paprica-run.sh script, then installing RAxML, pathway-tools, and taxtastic. The rest of this tutorial assumes you’ve done just that, including running the test.bacteria.fasta file against the bacteria database:
./paprica-run.sh test.bacteria bacteria
Install Remaining Dependencies
In addition to all the dependencies required by paprica-run.sh, you need pathway-tools and RAxML. These are very mainstream programs, but that doesn’t necessarily mean installation is easy. In particular pathway-tools requires that you request a license (free for academic users). This takes about 24 hours after which you’ll receive a link to download the installer. Regardless of whether you’re sitting at the workstation or accessing it via SSH, a GUI will pop up and guide you through the installation. In general you can accept the defaults, however, the GUI will ask you where pathway-tools should create the ptools-local directory. This is where the program will create the pathway-genome databases that describe (among other things) the metabolic pathways in each genome. By the time you are done creating the database this directory will be > 100 Gb, so pick a location with plenty of space! This might not be your home directory (the default location). For example on my system my home directory is housed on a small SSD. To keep the home directory from becoming bloated I opted to locate ptools-local on a separate drive.
You will receive a number of download options from the pathway-tools development team. I recommend that you conduct only the basic installation of pathway-tools (this is the EcoCyc and MetaCyc option), and do not download and install additional PGDBs. Nothing wrong with installing these additional, well-curated PGDBs other than increased space and time, but they become ponderous. You can always add them later if you want to become a metabolic modeling rock star.
Once you’ve installed pathway-tools you should be sure to add the program to your PATH, following standard methods. Once you’ve done this re-source .profile and type pathway-tools in a bash shell. The GUI should open.
RAxML is the one piece of software used by paprica that requires compilation. Fortunately the RAxML folks generally build good software, so compiling isn’t likely to be a problem. RAxML comes in several flavors and paprica is a bit particular about the version of RAxML it expects to find. RAxML gets called in two scripts; paprica-get_ref.py and paprica-place_it.py. These scripts call “raxmlHPC-PTHREADS-AVX2”, so you need to make sure you build the parallel threaded AVX2 version, i.e.:
make -f Makefile.AVX2.PTHREADS.gcc
If you need to not do that due to hardware limitations (a very old workstation or some such) you’ll need to modify those scripts accordingly or actually name the working RAxML command raxmlHPC-PTHREADS-AVX2 (I cringe at this suggestion, but it is the simplest solution). Get in contact with me if you have issues. As with all dependencies, after you build raxmlHPC-PTHREADS-AVX2 you need to add the RAxML directory to your PATH, re-source .profile, and test the installation by typing raxmlHPC-PTHREADS-AVX2 in a bash shell. If RAxML yells at you with a warning it installed correctly and you’re good to go.
There are a lot of moving parts in paprica-build.sh. Because of this, and because of the amount of time certain steps take to complete, troubleshooting can be a little frustrating. To make things easier you can download a fake ref_genome_database directory here with 11 genomes pre-loaded (10 in the bacteria/refseq and 1 in user/bacteria). Download, remove the old ref_genome_database directory that came with paprica, and untar the new one:
rm -r ref_genome_database wget http://www.polarmicrobes.org/extras/ref_genome_database.tgz tar -xzvf ref_genome_database.tgz
At this point in time it is absolutely essential that you open paprica-build.sh and switch the -download flag in the paprica-make_ref.py line from “T” to “test”. If you don’t do this you will initiate a fresh download of all the completed genomes in Genbank. Make sure you switch this flag back to T when you are done testing and ready to actually build the database.
Once you’ve downloaded the test ref_genome_database directory and switched the flag simply execute the script:
You’ll see lots of output flash by on the screen. Keep an eye out for error messages that could indicate something amiss. If something goes wrong the error messages should be quite obvious as the later steps will fail completely. I have noticed that for reasons which are not yet clear, when building the full database the script sometimes hangs after PGDB creation. If that happens use control-c to exit the bash script, and re-execute the paprica-build_core_genomes.py script. You can re-execute the whole bash script if you like, but it takes some time and isn’t necessary. If you start seeing messages like:
collecting data for internal node 13, 1 of 9 collecting data for internal node 19, 2 of 9 collecting data for internal node 16, 3 of 9 collecting data for internal node 6, 4 of 9 collecting data for internal node 2, 5 of 9 collecting data for internal node 5, 6 of 9 collecting data for internal node 15, 7 of 9 collecting data for internal node 9, 8 of 9 collecting data for internal node 14, 9 of 9
…you can relax, you’re at the end and nothing went amiss. This small collection of genomes includes the CCG for the reads in test.bacteria.fasta, so you can even test your new “database” with paprica-run.sh:
./paprica-run.sh test.bacteria bacteria
Add Custom Draft Genomes (Optional)
V0.3.0 of paprica introduced the ability to add custom draft genomes. These genomes add additional CCGs to your reference tree and increase the accuracy of the metabolic inference. Because they are not necessarily complete however, they are not used to calculate genome parameters such as the number of 16S rRNA genes or the length of the genome. Placements to these edges should produce NA values in the edge_data.csv file produced by paprica-run.sh. If you don’t want to add custom draft genomes you don’t need to do anything at this point. If you want to add draft genomes you should create a unique directory for each, based on accession number, in ref_genome_database/user/bacteria or ref_genome_database/user/archaea (as appropriate). For example, to add several draft Sulfitobacter genomes my directory structure looks like this:
me@computer:/home/me/paprica/ref_genome_database/user/bacteria$ls draft.combined_16S.fasta GCF_000152645.1 GCF_000622325.1 GCF_000622365.1 GCF_000622405.1 GCF_000647675.1 GCF_000735125.1 GCF_000152605.1 GCF_000620505.1 GCF_000622345.1 GCF_000622385.1 GCF_000622425.1 GCF_000712315.1
Inside each of these directories are the .fna file for the draft genome and the genomic .gbff file (use that extension, not .gbk). So like this:
me@computer:/home/me/paprica/ref_genome_database/user/bacteria/GCF_000152645.1$ ls GCF_000152645.1_ASM15264v1_genomic.fna GCF_000152645.1_ASM15264v1_genomic.gbff
When you build the database paprica-make_ref.py will automatically look in the user directory and attempt to use anything it finds there. Your draft genome must have a 16S rRNA gene, which not all of them do. If it does not the build will continue without your draft genome.
Add Custom EC Numbers
If EC numbers are reported for genomes that you’re pretty sure should be there (annotations aren’t perfect) you can let paprica know about this by modifying the user_ec.csv file in ref_genome_database/user. Open the file in a text editor and you should see this:
## This file can be used to indicate enzymes by EC number that you ## believe should be in a genome, but do not appear in the genome. ## These EC numbers will propagate to internal nodes, but will not ## be used in pathway prediction. The first row of this file is a ## header and should not be modified. Rows should be numbered ## sequentially. ## ## Example rows: ## 1,GCF12345,18.104.22.168,happy enzyme ## 2,GCF54321,22.214.171.124,sad enzyme ## ,GI_number,EC_number,product
Follow the instructions to add new enzymes to your database. These enzymes will propagate to internal nodes every time you update your database, but they will not be used in pathway prediction.
Build the Database
If you tested paprica-build.sh as described above (you should!) make sure you switch the -download flag back to T. This will tell paprica to initiate a fresh download of genomes from Genbank. Don’t worry about the ref_genome_database directory you used for testing, it is fully compatible with the new download. Once you’ve done this initiate a build of the real database:
Of course if you wanted to build the archaeal database you would replace “bacteria” with “archaea”.
One of the particularly time consuming steps in paprica-build.sh is pathway prediction by pathway-tools. In addition to taking some time (it has a lot of work to do) pathway-tools needs to send you graphical messages. You can ignore these, but if you close the SSH session progress will stop because pathway-tools has no place to send the messages.