A couple of months ago I published paprica v0.11, a set of scripts for conducting a metabolic inference from a collection of 16S rRNA gene reads. This approach allows you to estimate the functional capabilities of a microbial community if you don’t have access to a metagenome or metatranscriptome. Paprica started as a method for a paper I was writing but eventually became complex enough to warrant it’s own publication. Paprica v0.11 reflected this origin – it produced nice results but was cludgy and cumbersome.
Over the last couple of weeks I’ve given paprica a complete overhaul and am happy to introduce v0.20. There are a number of major differences between v0.11 and v0.20, but the most significant difference is a more clear division between construction of the database for those who want full control (and access to the PGDBs) and sample analysis, which can proceed with only the provided, light-weight database (however you will not have access to the PGDBs). Executing paprica v0.20 is as easy as (from your home directory, for the provided file test.fasta):
git clone https://github.com/bowmanjeffs/genome_finder.git cd genome_finder chmod a+x paprica_run.sh ./paprica_run.sh test
One really important distinction between this version and v0.11 is that metabolic pathways are NOT predicted directly on internal nodes. This was done for reasons of organization and efficiency, but I’m not sure that it made much sense to do this anyway. Instead the pathways likely to be found for an internal node are inferred from their appearance in terminal daughter nodes (that is, the completed genomes that belong to the clade defined by the internal node). If a given pathway is present in some specified fraction (0.90 by default) of the terminal daughters it is included in the internal node. You can change this value by modifying the appropriate variable in pathway_profile.txt. Some (including myself) might like to have a PGDB for an internal node for purposes of visualization or modeling. In the near future I’ll release a utility to create a PGDB for an internal node on demand.
Some other major improvements…
- Fewer dependencies. For the scripts called in paprica_run.sh you need pplacer, seqmagick, infernal, and some Python modules that you should probably have anyway.
- Improved reference tree. I’m still working on this, but the current method uses RAxML for phylogenetic inference and Infernal for aligment, which seems to work much better than the previous (albeit much faster) combo of Fasttree and Mothur. Thanks to Eric Matsen for helpful suggestions in this regard.
- More genome parameters. I have a particular interest in how genome parameters (e.g. length, coding density, etc.) are distributed in the environment. Paprica gives you a whole list of interesting metrics for the terminal and internal nodes.
Paprica is still in heavy development and I have a lot of improvements planned for future versions. If you try v0.20 I’d love to know what you think – good, bad, or otherwise! You can create an issue on Github or email me.