This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.
Starting with version 3.0.0 paprica contains a metagenomic annotation module. This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database. The module produces as output:
- Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
- A tally of the occurrence of each EC number in the sample, with some useful supporting information.
- Optionally, the metabolic pathways likely to be present within the sample.
In addition to the normal paprica-run.sh dependencies paprica-mg requires DIAMOND Blastx. Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH. If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software. See the notes here.
Obtain the paprica-mg database
There are two ways to obtain the paprica-mg database. You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and paprica-mg.ec.csv.gz (large!) to the ref_genome_database directory in the paprica directory. Be sure to gunzip paprica-mg.ec.csv.gz before continuing.
## Navigate to wherever you installed the paprica database cd ~/paprica/ref_genome_database ## Download the two files that constitute the paprica-mg database wget http://www.polarmicrobes.org/extras/paprica-mg.dmnd wget http://www.polarmicrobes.org/extras/paprica-mg.ec.csv.gz ## Gunzip paprica-mg.ec.csv.gz gunzip paprica-mg.ec.csv.gz
Alternatively, if you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database. Then build the paprica-mg database as such:
paprica-mg_build.py -ref_dir ref_genome_database
If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself. Likewise you don’t need to be in the paprica directory to execute the paprica-mg_run.py script.
Annotate a metagenome
Once you’ve downloaded or built the database you can run your analysis. It is worth spending a little time with the DIAMOND manual and considering the parameters of your system. To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):
## Download a test metagenome wget http://www.polarmicrobes.org/extras/ERR318619_1.qc.fasta.gz ## Execute paprica-mg for EC annotation only paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F
This will produce the following output:
test.annotation.csv: The number of hits in the metagenome, by EC number. See the paprica manual for a complete explanation of columns.
test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.
test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.
Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this. I’ve tried to balance speed with accuracy in paprica-mg. If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction. On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database. Of course most reads will have far fewer (if any) hits. The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present. For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B. When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements. Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway. So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B. The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized. To predict pathways during annotation:
## execute paprica-mg for EC annotation and pathway prediction paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local
In addition to the files already mentioned, you will see:
test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.
test.pathways.txt: A simple text file of all the pathways that were predicted.
test.paprica-mg.txt: A very large text file of all hits for each read. You probably want to delete this right away to save space.
test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read. You probably want to delete this right away to save space.
testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports. It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.