paprica – The Bowman Lab

Tutorial: Basic heatmaps and ordination with paprica output

Jeff — Mon, 11 Mar 2019 11:15:46 +0000

The output from our paprica pipeline for microbial community structure analysis and metabolic inference has changed quite a lot over the last few months. In response to some recent requests here’s a tutorial that walks through an ordination and a few basic plots with the paprica output. The tutorial assumes that you’ve completed this tutorial which runs paprica on the samples associated with our recent seagrass paper.

For our analysis lets bring the pertinent files into R and do some pre-processing:

## read in the edge and unique abundance tables. Note that it's going to take a bit to load the unique_tally file because you have a lot of variables!

tally <- read.csv('2017.07.03_seagrass.bacteria.edge_tally.csv', header = T, row.names = 1)
unique <- read.csv('2017.07.03_seagrass.bacteria.unique_tally.csv', header = T, row.names = 1)

## read in edge_data and taxon_map and seq_edge_map

data <- read.csv('2017.07.03_seagrass.bacteria.edge_data.csv', header = T, row.names = 1)
taxa <- read.csv('2017.07.03_seagrass.bacteria.taxon_map.csv', header = T, row.names = 1, sep = ',', as.is = T)
map <- read.csv('2017.07.03_seagrass.bacteria.seq_edge_map.csv', header = T, row.names = 1)

## convert all na's to 0, then check for low abundance samples

tally[is.na(tally)] <- 0
unique[is.na(unique)] <- 0
rowSums(tally)

## remove any low abundance samples (i.e. bad library builds), and also
## low abundance reads.  This latter step is optional, but I find it useful
## unless you have a particular interest in the rare biosphere.  Note that
## even with subsampling your least abundant reads are noise, so at a minimum
## exclude everything that appears only once.

tally.select <- tally[rowSums(tally) > 5000,]
tally.select <- tally.select[,colSums(tally.select) > 1000]

unique.select <- unique[rowSums(unique) > 5000,]
unique.select <- unique.select[,colSums(unique.select) > 1000]

If your experiment is based on factors (i.e. you want to test for differences between categories of samples) you may want to use DESeq2, otherwise I suggest normalizing by sample abundance.

## normalize

tally.select <- tally.select/rowSums(tally.select)
unique.select <- unique.select/rowSums(unique.select)

Now we’re going to do something tricky. For both unique.select and tally.select, rows are observations and columns are variables (edges or unique reads). Those likely don’t mean much to you unless you’re intimately familiar with the reference tree. We can map the edge numbers to taxa using “taxa” dataframe, but first we need to remove the “X” added by R to make the numbers legal column names. For the unique read labels, we need to split on “_”, which divides the unique read identified from the edge number.

## get edge numbers associated with columns, and map to taxa names.
## If the entry in taxon is empty it means the read could not be classifed below
## the level of the domain Bacteria, and is labeled as "Bacteria"

tally.lab.Row <- taxa[colnames(tally.select), 'taxon']
tally.lab.Row[tally.lab.Row == ""] <- 'Bacteria'

unique.lab.Row <- map[colnames(unique.select), 'global_edge_num']
unique.lab.Row <- taxa[unique.lab.Row, 'taxon']
unique.lab.Row[unique.lab.Row == ""] <- 'Bacteria'
unique.lab.Row[is.na(unique.lab.Row)] <- 'Bacteria'

In the above block of code I labeled the new variables as [tally|unique].lab.Row, because we’ll first use them to label the rows of a heatmap. Heatmaps are a great way to start getting familiar with your data.

## make a heatmap of edge abundance

heat.col <- colorRampPalette(c('white', 'lightgoldenrod1', 'darkgreen'))(100)

heatmap(t(data.matrix(tally.select)),
        scale = NULL,
        col = heat.col,
        labRow = tally.lab.Row,
        margins = c(10, 10))

heatmap(t(data.matrix(unique.select)),
        scale = NULL,
        col = heat.col,
        labRow = unique.lab.Row,
        margins = c(10, 10))

Heatmaps are great for visualizing broad trends in the data, but they aren’t a good entry point for quantitative analysis. A good next step is to carry out some kind of ordination (NMDS, PCoA, PCA, CA). Not all ordination methods will work well for all types of data. Here we’ll use correspondence analysis (CA) on the relative abundance of the unique reads. CA will be carried out with the package “ca”, while “factoextra” will be used to parse the CA output and calculate key additional information. You can find a nice in-depth tutorial on correspondence analysis in R here.

library(ca)
library(factoextra)

unique.select.ca <- ca(unique.select)
unique.select.ca.var <- get_eigenvalue(unique.select.ca)
unique.select.ca.res <- get_ca_col(unique.select.ca)

species.x <- unique.select.ca$colcoord[,1]
species.y <- unique.select.ca$colcoord[,2]

samples.x <- unique.select.ca$rowcoord[,1]
samples.y <- unique.select.ca$rowcoord[,2]

dim.1.var <- round(unique.select.ca.var$variance.percent[1], 1)
dim.2.var <- round(unique.select.ca.var$variance.percent[2], 2)

plot(species.x, species.y,
     ylab = paste0('Dim 2: ', dim.2.var, '%'),
     xlab = paste0('Dim 1: ', dim.1.var, '%'),
     pch = 3,
     col = 'red')

points(samples.x, samples.y,
       pch = 19)

legend('topleft',
       legend = c('Samples', 'Unique reads'),
       pch = c(19, 3),
       col = c('black', 'red'))

At this point you’re ready to crack open the unique.select.ca object and start doing some hypothesis testing. There’s one more visualization, however, that can help with initial interpretation; a heatmap of the top unique edges contributing to the first two dimensions (which account for nearly all of the variance between samples).

species.contr <- unique.select.ca.res$contrib[,1:2]
species.contr.ordered <- species.contr[order(rowSums(species.contr), decreasing = T),]
species.contr.top <- species.contr.ordered[1:10,]

species.contr.lab <- unique.lab.Row[order(rowSums(abs(species.contr)), decreasing = T)]

heatmap(species.contr.top,
        scale = 'none',
        col = heat.col,
        Colv = NA,
        margins = c(10, 20),
        labRow = species.contr.lab[1:10],
        labCol = c('Dim 1', 'Dim 2'),
        cexCol = 1.5)

From this plot we see that quite a few different taxa are contributing approximately equally to Dim 1 (which accounts for much of the variance between samples), including several different Pelagibacter and Rhodobacteracea strains. That makes sense as the dominant environmental gradient in the study was inside vs. outside of San Diego Bay and we would expect these strains to be organized along such a gradient. Dim 2 is different with unique reads associated with Tropheryma whipplei and Rhodoluna lacicola contributing most. These aren’t typical marine strains, and if we look back at the original data we see that these taxa are very abundant in just two samples. These samples are the obvious outliers along Dim 2 in the CA plot.

In this tutorial we covered just the community structure output from paprica, but of course the real benefit to using paprica is its estimation of metabolic potential. These data are found in the *.ec_tally.csv and *path_tally.csv files, and organized in the same way as the edge and unique read abundance tables. Because of this they can be plotted and analyzed in the same way.

Separating bacterial and archaeal reads for analysis with paprica

Jeff — Fri, 07 Sep 2018 13:50:31 +0000

One of the most popular primer sets for 16S rRNA gene amplicon analysis right now is the 515F/806R set. One of the advantages of this pair is that it amplifies broadly across the domains Archaea and Bacteria. This reduces by half the amount of work required to characterize prokaryotic community structure, and allows a comparison of the relative (or absolute, if you have counts) abundance of bacteria and archaea. However, paprica and many other analysis tools aren’t designed to simultaneously analyze reads from both domains. Different reference alignments or covariance models, for example, might be required. Thus it’s useful to split an input fasta file into separate bacterial and archaeal files.

We like to use the Infernal tool cmscan for this purpose. First, you’ll need to acquire covariance models for the 16S/18S rRNA genes from all three domains of life. You can find those on the Rfam website, they are also included in paprica/models if you’ve downloaded paprica. Copy the models to new subdirectory in your working directory while combining them into a single file:

mkdir cms
cat ../paprica/models/*cm > cms/all_domains.cm
cd cms

Now you need to compress and index the covariance models using the cmpress utility provided by Infernal. This takes a while.

cmpress all_domains.cm

Pretty simple. Now you’re ready to do some work. The whole Infernal suite of tools has pretty awesome built-in parallelization, but with only three covariance models in the file you won’t get much out of it. Best to minimize cmscan’s use of cores and instead push lots of files through it at once. This is easily done with the Gnu Parallel command:

ls *.fasta | parallel -u cmscan --cpu 1 --tblout {}.txt cmscan/all_domains.cm {} > /dev/nul

Next comes the secret sauce. The command above produces an easy-to-parse, easy-to-read table with classification stats for each of the covariance models that we searched against. Paprica contains a utility in paprica/utilities/pick_16S_domain.py to parse the table and figure out which model scored best for each read, then make three new fasta files for each of domains Bacteria, Archaea, and Eukarya (the primers will typically pick up a few euks). We’ll parallelize the script just as we did for cmscan.

ls *.fasta | parallel -u python pick_16S_domain_2.py -prefix {} -out {}

Now you have domain-specific files that you can analyze in paprica or your amplicon analysis tool of choice!

paprica v0.4.0

Jeff — Sun, 08 Jan 2017 20:04:40 +0000

I’m happy to announce the release of paprica v0.4.0. This release adds a number of new features to our pipeline for evaluating microbial community and metabolic structure. These include:

NCBI taxonomy information for each point of placement on the reference tree, including internal nodes.
Inclusion of the domain Eukarya. This was a bit tricky and requires some further explanation.

The distribution of metabolic pathways, predicted during the creation of the paprica Eukarya database, across transcriptomes in the MMETSP.

Eukaryotic genomes are a totally different beast than their archaeal and bacterial counterparts. First and foremost they are massive. Because of these there aren’t very many completed eukaryotic genomes out there, particularly for singled celled eukaryotes. While a single investigator can now sequence, assemble, and annotate a bacterial or archaeal genome in very little time, eukaryotic genomes still require major efforts by consortia and lots of $$.

One way to get around this scale problem is to focus on eukaryotic transcriptomes instead of genomes. Because much of the eukaryotic genome is noncoding this greatly reduces sequencing volume. Since there is no such thing as a contiguous transcriptome, this approach also implies that no assembly (beyond open reading frames) will be attempted. The Moore Foundation-funded Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) was an initial effort to use this approach to address the problem of unknown eukaryotic genetic diversity. The MMETSP sequenced transcriptomes from several hundred different strains. The taxonomic breadth of the strains sequenced is pretty good, even if (predictably) the taxonomic resolution is not. Thus, as for archaea, the phylogenetic tree and metabolic inferences should be treated with caution. For eukaryotes there are the additional caveats that 1) not all genes coded in a genome will be represented in the transcriptome 2) the database contains only strains from the marine environment and 3) eukaryotic 18S trees are kind of messy. Considerable effort went into making a decent tree, but you’ve been warned.

Because the underlying data is in a different format, not all genome parameters are calculated for the eukaryotes. 18S gene copy number is not determined (and thus community and metabolic structure are not normalized), the phi parameter, GC content, etc. are also not calculated. However, eukaryotic community structure is evaluated and metabolic structure inferred in the same way as for the domains bacteria and archaea:

./paprica-run.sh test.eukarya eukarya

As always you can install paprica v0.4.0 by following the instructions here, or you can use the virtual box or Amazon Web Service machine instance.

paprica on the cloud

Jeff — Sat, 26 Nov 2016 03:15:16 +0000

This is a quick post to announce that paprica, our pipeline to evaluate community structure and conduct metabolic inference, is now available on the cloud as an Amazon Machine Instance (AMI). The AMI comes with all dependencies required to execute the paprica-run.sh script pre-installed. If you want to use it for paprica-build.sh you’ll have to install pathway-tools and a few additional dependencies. I’m new to the Amazon EC2 environment, so please let me know if you have any issues using the AMI.

If you are new to Amazon Web Services (AWS) the basic way this works is:

Sign up for Amazon EC2 using your normal Amazon log-in
From the AWS console, make sure that your region is N. Virginia (community AMI’s are only available in the region they were created in)
From your EC2 dashboard, scroll down to “Create Instance” and click “Launch Instance”
Now select the “Community AMIs”
Search for “paprica-ec2”, then select the AMI corresponding to the latest version of paprica (0.4.0 at the time of writing).
Choose the type of instance you would like to run the AMI on. This is the real power of AWS; you can tailor the instance to the analysis you would like to run. For testing choose the free t2.micro instance. This is sufficient to execute the test files or run a small analysis (hundreds of reads). To use paprica’s parallel features select an instance with the desired number of cores and sufficient memory.
Click “Review and Launch”, and finish setting up the instance as appropriate.
Log onto the instance, navigate to the paprica directory, execute the test file(s) as described in the paprica tutorial. The AMI is not updated as often as paprica, so you may wish to reclone the github repository, or download the latest stable release.

Exploring genome content and genomic character with paprica and R

Jeff — Thu, 07 Jul 2016 17:07:24 +0000

The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads. To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank. In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically. The following is a brief analysis of some genomic features using the paprica database and R. If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.

The paprica pipeline and database can be obtained from Github here. In this post I’ll be using the database associated with version 0.3.1. The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:

## Read in the pathways associated with the terminal nodes on the reference tree
path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1)
path[is.na(path)] <- 0

## Read in the data associated with all completed genomes in Genbank
data <- read.csv('paprica/ref_genome_database/bacteria/genome_data.final.csv', row.names = 1)

## During database creation genomes with duplicate 16S rRNA genes were removed,
## so limit to those that were retained
data <- data[row.names(data) %in% row.names(path),]

## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree,
## however, "data" is not, so order it here
data <- data[order(data$clade),]

One fun thing to do at this point is to look at the distribution of metabolic pathways across the database. To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.

## The pathway file in the database is binary, so we use Jaccard for distance
library('vegan')
path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!)
path.clust <- hclust(path.dist)

The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.

## Set a binary color scheme
image.col <- colorRampPalette(c('white', 'blue'))(2)

## Image will order matrix in ascending order, which is not what we want here!
image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1],
      col = image.col,
      ylab = 'Genome',
      xlab = 'Pathway',
      xaxt = 'n',
      yaxt = 'n')

box()

The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

There are a couple of interesting things to note in this plot. First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage. For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough. Second, there are some genomes that have very few pathways. There are a couple of possible reasons for this that can be explored with a little more effort. One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction. Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts). Obligate symbionts often have highly streamlined genomes and few complete pathways. We can compare the number of pathways in each genome to other genome characteristics for additional clues.

A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome. Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.

## Plot the number of pathways as a function of genome size
plot(rowSums(path) ~ data$genome_size,
     ylab = 'nPaths',
     xlab = 'Genome size')

## Plot P. ubique as a reference point
select <- grep('Pelagibacter ubique HTCC1062', data$organism_name)
points(rowSums(path)[select] ~ data$genome_size[select],
       pch = 19,
       col = 'red')

The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

That looks pretty good. For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions. The red point gives the location of Pelagibacter ubique HTCC1062. This marine bacterium is optimized for life under energy-limited conditions. Among its adaptations are a highly efficient and streamlined genome. In fact it has the smallest genome of any known free-living bacterium. All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs. There are a few larger genomes that have very few (or even no) pathways predicted. These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).

The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity. We can make another plot to show the distribution of these parameters with respect to phylogeny.

## Grab only the data columns we want
data.select <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')]

## Make the units somewhat comparable on the same scale, a more
## careful approach would log-normalize some of the units first
data.select.norm <- decostand(data.select, method = 'standardize')
data.select.norm <- decostand(data.select.norm, method = 'range')

## Plot with a heatmap
heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100)
heatmap(data.matrix(data.select.norm),
      margins = c(10, 20),
      col = heat.col,
      Rowv = NA,
      Colv = NA,
      scale = NULL,
      labRow = 'n',
      cexCol = 0.8)

Genomic parameters organized by phylogeny.

Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting. These two parameters can be plotted to get a better view:

plot(data.select$phi ~ data.select$GC,
     xlab = 'GC',
     ylab = 'phi')

The phi parameter of genomic plasticity as a function of GC content.

Okay, not so much… but I think the pattern here is pretty interesting. Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes. This can be evaluated with linear models for genomes above and below 50 % GC.

gc.phi.above50 <- lm(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)])
gc.phi.below50 <- lm(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)])

summary(gc.phi.above50)
summary(gc.phi.below50)

plot(data.select$phi ~ data.select$GC,
     xlab = 'GC',
     ylab = 'phi',
     type = 'n')

points(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)],
       col = 'blue')

points(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)],
       col = 'red')

abline(gc.phi.above50,
       col = 'blue')

abline(gc.phi.below50,
       col = 'red')

legend('bottomleft',
       bg = 'white',
       legend = c('GC >= 50',
                  'GC < 50'),
       col = c('blue', 'red'),
       pch = 1)

Genomic plasticity (phi) as a function of GC content for all bacterial genomes in the paprica database.

As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R² = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R² = 0.106, p = 0). So what’s going on here? Low GC content is associated with particular microbial lineages but also with certain ecologies. The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is). Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.

Tutorial: Annotating metagenomes with paprica-mg

Jeff — Sat, 26 Mar 2016 07:00:46 +0000

This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.

Starting with version 3.0.0 paprica contains a metagenomic annotation module. This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database. The module produces as output:

Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
A tally of the occurrence of each EC number in the sample, with some useful supporting information.
Optionally, the metabolic pathways likely to be present within the sample.

In addition to the normal paprica-run.sh dependencies paprica-mg requires DIAMOND Blastx. Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH. If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software. See the notes here.

Obtain the paprica-mg database

There are two ways to obtain the paprica-mg database. You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and paprica-mg.ec.csv.gz (large!) to the ref_genome_database directory in the paprica directory. Be sure to gunzip paprica-mg.ec.csv.gz before continuing.

## Navigate to wherever you installed the paprica database
cd ~/paprica/ref_genome_database

## Download the tarball the contains the paprica-mgt database
wget https://www.polarmicrobes.org/extras/paprica-mgt.database.tgz

## Untar
tar -xzvf paprica-mgt.database.tgz

Alternatively, if you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database. Then build the paprica-mgt database as such:

paprica-mgt_build.py -ref_dir ref_genome_database

If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself. Likewise you don’t need to be in the paprica directory to execute the paprica-mg_run.py script.

Annotate a metagenome

Once you’ve downloaded or built the database you can run your analysis. It is worth spending a little time with the DIAMOND manual and considering the parameters of your system. To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):

## Download a test metagenome
wget https://www.polarmicrobes.org/extras/ERR318619_1.qc.fasta.gz

## Execute paprica-mg for EC annotation only
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F

This will produce the following output:

test.annotation.csv: The number of hits in the metagenome, by EC number. See the paprica manual for a complete explanation of columns.

test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.

test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.

Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this. I’ve tried to balance speed with accuracy in paprica-mg. If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction. On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database. Of course most reads will have far fewer (if any) hits. The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present. For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B. When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements. Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway. So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B. The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized. To predict pathways during annotation:

## execute paprica-mg for EC annotation and pathway prediction
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local

In addition to the files already mentioned, you will see:

test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.

test.pathways.txt: A simple text file of all the pathways that were predicted.

test.paprica-mg.txt: A very large text file of all hits for each read. You probably want to delete this right away to save space.

test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read. You probably want to delete this right away to save space.

testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports. It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.

Tutorial: Building the paprica database

Jeff — Fri, 11 Mar 2016 16:04:12 +0000

This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.

Building the paprica database provides maximum flexibility but involves more moving parts and resources than using the provided database. Basic instructions for using the paprica-build.sh script are provided in the manual, this tutorial is intended to provide an even more detailed step-by-step guide.

Requirements

While a laptop running Linux, Windows with VirtualBox, or OSX is perfectly adequate for analysis with paprica, you’ll need something a little beefier for building the database (unless you’re really patient). A high performance cluster is overkill; I build the provided database on a basic 12 core Linux workstation with 32 Gb RAM (< $5k). Something in this ballpark should work fine, of course more cores will get the job done faster (but keep an eye on memory useage).

Once you’ve got the hardware requirements sorted out you need to download the dependencies. I recommend first following all the instructions for the paprica-run.sh script, then installing RAxML, pathway-tools, and taxtastic. The rest of this tutorial assumes you’ve done just that, including running the test.bacteria.fasta file against the bacteria database:

./paprica-run.sh test.bacteria bacteria

Install Remaining Dependencies

In addition to all the dependencies required by paprica-run.sh, you need pathway-tools and RAxML. These are very mainstream programs, but that doesn’t necessarily mean installation is easy. In particular pathway-tools requires that you request a license (free for academic users). This takes about 24 hours after which you’ll receive a link to download the installer. Regardless of whether you’re sitting at the workstation or accessing it via SSH, a GUI will pop up and guide you through the installation. In general you can accept the defaults, however, the GUI will ask you where pathway-tools should create the ptools-local directory. This is where the program will create the pathway-genome databases that describe (among other things) the metabolic pathways in each genome. By the time you are done creating the database this directory will be > 100 Gb, so pick a location with plenty of space! This might not be your home directory (the default location). For example on my system my home directory is housed on a small SSD. To keep the home directory from becoming bloated I opted to locate ptools-local on a separate drive.

You will receive a number of download options from the pathway-tools development team. I recommend that you conduct only the basic installation of pathway-tools (this is the EcoCyc and MetaCyc option), and do not download and install additional PGDBs. Nothing wrong with installing these additional, well-curated PGDBs other than increased space and time, but they become ponderous. You can always add them later if you want to become a metabolic modeling rock star.

Once you’ve installed pathway-tools you should be sure to add the program to your PATH, following standard methods. Once you’ve done this re-source .profile and type pathway-tools in a bash shell. The GUI should open.

RAxML is the one piece of software used by paprica that requires compilation. Fortunately the RAxML folks generally build good software, so compiling isn’t likely to be a problem. RAxML comes in several flavors and paprica is a bit particular about the version of RAxML it expects to find. RAxML gets called in two scripts; paprica-get_ref.py and paprica-place_it.py. These scripts call “raxmlHPC-PTHREADS-AVX2”, so you need to make sure you build the parallel threaded AVX2 version, i.e.:

make -f Makefile.AVX2.PTHREADS.gcc

If you need to not do that due to hardware limitations (a very old workstation or some such) you’ll need to modify those scripts accordingly or actually name the working RAxML command raxmlHPC-PTHREADS-AVX2 (I cringe at this suggestion, but it is the simplest solution). Get in contact with me if you have issues. As with all dependencies, after you build raxmlHPC-PTHREADS-AVX2 you need to add the RAxML directory to your PATH, re-source .profile, and test the installation by typing raxmlHPC-PTHREADS-AVX2 in a bash shell. If RAxML yells at you with a warning it installed correctly and you’re good to go.

Test paprica-build.sh

There are a lot of moving parts in paprica-build.sh. Because of this, and because of the amount of time certain steps take to complete, troubleshooting can be a little frustrating. To make things easier you can download a fake ref_genome_database directory here with 11 genomes pre-loaded (10 in the bacteria/refseq and 1 in user/bacteria). Download, remove the old ref_genome_database directory that came with paprica, and untar the new one:

rm -r ref_genome_database
wget https://www.polarmicrobes.org/extras/ref_genome_database.tgz
tar -xzvf ref_genome_database.tgz

At this point in time it is absolutely essential that you open paprica-build.sh and switch the -download flag in the paprica-make_ref.py line from “T” to “test”. If you don’t do this you will initiate a fresh download of all the completed genomes in Genbank. Make sure you switch this flag back to T when you are done testing and ready to actually build the database.

Once you’ve downloaded the test ref_genome_database directory and switched the flag simply execute the script:

./paprica-build.sh bacteria

You’ll see lots of output flash by on the screen. Keep an eye out for error messages that could indicate something amiss. If something goes wrong the error messages should be quite obvious as the later steps will fail completely. I have noticed that for reasons which are not yet clear, when building the full database the script sometimes hangs after PGDB creation. If that happens use control-c to exit the bash script, and re-execute the paprica-build_core_genomes.py script. You can re-execute the whole bash script if you like, but it takes some time and isn’t necessary. If you start seeing messages like:

collecting data for internal node 13, 1 of 9
collecting data for internal node 19, 2 of 9
collecting data for internal node 16, 3 of 9
collecting data for internal node 6, 4 of 9
collecting data for internal node 2, 5 of 9
collecting data for internal node 5, 6 of 9
collecting data for internal node 15, 7 of 9
collecting data for internal node 9, 8 of 9
collecting data for internal node 14, 9 of 9

…you can relax, you’re at the end and nothing went amiss. This small collection of genomes includes the CCG for the reads in test.bacteria.fasta, so you can even test your new “database” with paprica-run.sh:

./paprica-run.sh test.bacteria bacteria

Add Custom Draft Genomes (Optional)

V0.3.0 of paprica introduced the ability to add custom draft genomes. These genomes add additional CCGs to your reference tree and increase the accuracy of the metabolic inference. Because they are not necessarily complete however, they are not used to calculate genome parameters such as the number of 16S rRNA genes or the length of the genome. Placements to these edges should produce NA values in the edge_data.csv file produced by paprica-run.sh. If you don’t want to add custom draft genomes you don’t need to do anything at this point. If you want to add draft genomes you should create a unique directory for each, based on accession number, in ref_genome_database/user/bacteria or ref_genome_database/user/archaea (as appropriate). For example, to add several draft Sulfitobacter genomes my directory structure looks like this:

me@computer:/home/me/paprica/ref_genome_database/user/bacteria$ls
draft.combined_16S.fasta  GCF_000152645.1  GCF_000622325.1  GCF_000622365.1  GCF_000622405.1  GCF_000647675.1  GCF_000735125.1
GCF_000152605.1           GCF_000620505.1  GCF_000622345.1  GCF_000622385.1  GCF_000622425.1  GCF_000712315.1

Inside each of these directories are the .fna file for the draft genome and the genomic .gbff file (use that extension, not .gbk). So like this:

me@computer:/home/me/paprica/ref_genome_database/user/bacteria/GCF_000152645.1$ ls
GCF_000152645.1_ASM15264v1_genomic.fna  GCF_000152645.1_ASM15264v1_genomic.gbff

When you build the database paprica-make_ref.py will automatically look in the user directory and attempt to use anything it finds there. Your draft genome must have a 16S rRNA gene, which not all of them do. If it does not the build will continue without your draft genome.

Add Custom EC Numbers

If EC numbers are reported for genomes that you’re pretty sure should be there (annotations aren’t perfect) you can let paprica know about this by modifying the user_ec.csv file in ref_genome_database/user. Open the file in a text editor and you should see this:

##    This file can be used to indicate enzymes by EC number that you
##    believe should be in a genome, but do not appear in the genome.
##    These EC numbers will propagate to internal nodes, but will not
##    be used in pathway prediction.  The first row of this file is a
##    header and should not be modified.  Rows should be numbered
##    sequentially.
##
##    Example rows:
##    1,GCF12345,1.1.1.1,happy enzyme
##    2,GCF54321,2.3.4.5,sad enzyme
##
,GI_number,EC_number,product

Follow the instructions to add new enzymes to your database. These enzymes will propagate to internal nodes every time you update your database, but they will not be used in pathway prediction.

Build the Database

If you tested paprica-build.sh as described above (you should!) make sure you switch the -download flag back to T. This will tell paprica to initiate a fresh download of genomes from Genbank. Don’t worry about the ref_genome_database directory you used for testing, it is fully compatible with the new download. Once you’ve done this initiate a build of the real database:

./paprica-build.sh bacteria

Of course if you wanted to build the archaeal database you would replace “bacteria” with “archaea”.

Cautionary Notes

One of the particularly time consuming steps in paprica-build.sh is pathway prediction by pathway-tools. In addition to taking some time (it has a lot of work to do) pathway-tools needs to send you graphical messages. You can ignore these, but if you close the SSH session progress will stop because pathway-tools has no place to send the messages.

Correctly evaluating metabolic inference methods

Jeff — Fri, 04 Mar 2016 16:47:08 +0000

Last week I gave a talk at the biennial Ocean Sciences Meeting that included some results from analysis with paprica. Since paprica is a relatively new method I showed the below figure to demonstrate that paprica works. The figure shows a strong correlation for four metagenomes between observed enzyme abundance and enzyme abundance predicted with paprica (from 16S rRNA gene reads extracted from the metagenome). This is similar to the approach used to validate PICRUSt and Tax4Fun.

Spearman’s correlation between predicted and observed enzyme abundance in four marine metagenomes.

The correlation looks decent, right? It’s not perfect, but most enzymes are being predicted at close to their observed abundance (excepting the green points where enzyme abundance is over-predicted because metagenome coverage is lower).

After the talk I was approached by a well known microbial ecologist who suggested that I compare these correlations to correlations with a random collection of enzymes. His concern was that because many enzymes (or genes, or metabolic pathways) are widely shared across genomes any random collection of genomes looks sort of like a metagenome. I gave this a shot and here are the results for one of the metagenomes used in the figure above.

Correlation between predicted and observed (red) and random and observed (black) enzyme abundances.

Uh oh. The correlation is better for predicted than random enzyme abundance, but rho = 0.7 is a really good correlation for the random dataset! If you think about it however, this makes sense. For this test I generated the random dataset by randomly selecting genomes from the paprica database until the total number of enzymes equaled the number predicted for the metagenome. Because there are only 2,468 genomes in the current paprica database (fewer than the total number of completed genomes because only one genome is used for each unique 16S rRNA gene sequence) the database gets pretty well sampled during random selection. As a result rare enzymes (which are also usually rare in the metagenome) are rare in the random sample, and common enzymes (also typically common in the metagenome) are common. So random ends up looking a lot like observed.

It was further suggested that I try and remove core enzymes for this kind of test. Here are the results for different definitions of “core”, ranging from enzymes that appear in less than 100 % of genomes (i.e. all enzymes, since no EC numbers appeared in all genomes) to those that appear in less than 1 % of genomes.

The difference between the random and predicted correlations does change as the definition of the core group of enzymes changes. Here’s the data aggregated for all four metagenomes in the form of a sad little Excel plot (error bars give standard deviation).

This suggests to me a couple of things. First, although I was initially surprised at the high correlation between a random and observed set of enzymes, I’m heartened that paprica consistently does better. There’s plenty of room for improvement (and each new build of the database does improve as additional genomes are completed – the last build added 78 new genomes, see the current development version) but the method does work. Second, that we obtain maximum “sensitivity”, defined as improvement over the random correlation, for enzymes that are present in fewer than 10 % of the genomes in that database. Above that and the correlation is inflated (but not invalidated) by common enzymes, below that we start to lose predictive power. This can be seen in the sharp drop in the predicted-random rho (Δrho: is it bad form to mix greek letters with the English version of same?) for enzymes present in less than 1 % of genomes. Because lots of interesting enzymes are not very common this is where we have to focus our future efforts. As I mentioned earlier some improvement in this area is automatic; each newly completed genome improves our resolution.

Some additional thoughts on this. There are parameters in paprica that might improve Δrho. The contents of closest estimated genomes are determined by a cutoff value – the fraction of descendant genomes a pathway or enzyme appears in. I redid the Δrho calculations for different cutoff values, ranging from 0.9 to 0.1. Surprisingly this had only a minor impact on Δrho. The reason for this is that most of the 16S reads extracted from the metagenomes placed to closest completed genomes (for which cutoff is meaningless) rather than closest estimated genomes. An additional consideration is that I did all of these calculations for enzyme predictions/observations instead of metabolic pathways. The reason for this is that predicting metabolic pathways on metagenomes is rather complicated (but doable). Pathways have the advantage of being more conserved than enzymes however, so I expect to see an improved Δrho when I get around to redoing these calculations with pathways.

Something else that’s bugging me a bit… metagenomes aren’t sets of randomly distributed genomes. Bacterial community structure is usually logarithmic, with a few dominant taxa and a long tail of rare taxa. The metabolic inference methods by their nature capture this distribution. A more interesting test might be to create a logarithmically distributed random population of genomes, but this adds all kinds of additional complexities. Chief among them being the need to create many random datasets with different (randomly selected) dominant taxa. That seems entirely too cumbersome for this purpose…

So to summarize…

Metabolic inference definitively outperforms random selection. This is good, but I’d like the difference (Δrho) to be larger than it is.
It is not adequate to validate a metabolic inference technique using correlation with a metagenome alone. The improvement over a randomly generated dataset should be used instead.
paprica, and probably other metabolic inference techniques, have poor predictive power for rare (i.e. very taxonomically constrained) enzymes/pathways. This shouldn’t surprise anyone.
Alternate validation techniques might be more useful than correlating with the abundance of enzymes/pathways in metagenomes. Alternatives include correlating the distance in metabolic structure between samples with distance in community structure, as we did in this paper, or correlating predictions for draft genomes. In that case it would be necessary to generate a distribution of correlation values for the draft genome against the paprica (or other method’s) database, and see where the correlation for the inferred metabolism falls in that distribution. Because the contents of a draft genome are a little more constrained than the contents of a metagenome I think I’m going to spend some time working on this approach…

Tutorial: Analysis with paprica

Jeff — Mon, 25 Jan 2016 17:42:11 +0000

Paprika. Not to be confused with paprica.

This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.

Getting Started

I’ve been making a lot of improvements to paprica, our program for conducting metabolic inference on 16S rRNA gene sequence libraries. The following is a complete analysis example with paprica to clarify the steps described in the wiki, and to highlight some of the recent improvements to the method. I’ll continue to update this tutorial as the method evolves. This tutorial assumes that you have all the dependencies for paprica-run.sh installed and in your PATH. If you’re a Mac user you can follow the instructions here. Linux (including Windows Subsystem for Linux) users should use this script as a guide. This tutorial has been re-written for the most recent version of paprica, and does not directly apply to v0.6 or earlier.

Although not required for this tutorial, I recommend that you have R installed (probably with RStudio) for downstream analysis, and the Archaeopteryx tree viewer. Follow the appropriate instructions for your platform for R and RStudio. Archaeopteryx is a little more convoluted; after first installing Java, install Archaeopteryx as such:

## Download the jar file, note that you might need to update this link to reflect the current version.  Visit https://sites.google.com/site/cmzmasek/home/software/archaeopteryx to check.
wget http://www.phyloxml.org/download/forester/forester_1050.jar
mv forester_1050.jar archaeopteryx.jar
## Download the configuration file and rename to something that works
wget http://www.phyloxml.org/download/forester/archaeopteryx/_aptx_configuration_file.txt
mv _aptx_configuration_file.txt aptx_configuration_file

Double click on archaeopteryx.jar to start the program. Finally, this tutorial assumes that you are using the provided database of metabolic pathways and genome data included in the ref_genome_database directory. If you want to build a custom database you should follow this tutorial. All the dependencies are installed and tested? Before we start let’s get familiar with some terminology.

closest completed genome (CCG): One type of edge: the most closely taxonomically related completed genome to a query read. This term is only applicable for query reads that place to a terminal edge (branch tip) on the reference tree.

closest estimated genome (CEG): Another type of edge: the set of genes that are estimated to be found in all members of a clade originating at a branch point in the reference tree. This term is only applicable to query reads that place to an internal edge on the reference tree. The CEG is the point of placement.

community structure: The taxonomic structure of a bacterial assemblage.

edge: An edge is a point of placement on a reference tree. Think of it as a branch of the reference tree. Edges take the place of OTUs in this workflow, and are ultimately far more interesting and informative than OTUs. Refer to the pplacer documentation for more.

unique read: A read from a dataset that has been denoised using (e.g.) dada2.

metabolic structure: The abundance of different metabolic pathways within a bacterial assemblage.

reference tree: This is the tree of representative 16S rRNA gene sequences from all completed Bacterial genomes in Genbank. The topology of the tree defines what pathways are predicted for internal branch points.

Overview

The basic steps that paprica takes during analysis are:

Identify the reads that belong to the specified domain (bacteria, archaea, eukarya).
Place the reads on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes (Archaea) or representatives from all phyla present in RefSeq genomes (Bacteria). For bacteria, reads that place to a phylum (i.e. not an internal node) are then placed on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes belonging to that phylum. The domain Eukarya follows a similar procedure as for Bacteria, except that an 18S rRNA gene tree is used and division is used rather than phylum. The 18S rRNA gene sequences come from the PR2 database rather than RefSeq.
For Bacteria and Archaea only, map the enzymes, pathways, and genome parameters present at each point of placement to the query reads.
Use the points of placement to provide a consensus taxonomy for each query read.

Testing the Installation

You can test your installation of paprica and its dependencies using the provided test.bacteria.fasta or test.archaea.fasta files. For test.bacteria.fasta, from the paprica directory:

./paprica-run.sh test bacteria

The first argument specifies the name of the input fasta file (test.bacteria.fasta) without the extension. The second argument specified which domain you are analyzing for. Executing the command produces a variety of output files in the paprica directory:

ls test*
temp.bacteria
test.archaea.fasta
test.archaeal16S.reads.txt
test.bacteria.clean.fasta
test.bacteria.clean.unique.align.sto
test.bacteria.clean.unique.count
test.bacteria.clean.unique.fasta
test.bacteria.combined_16S.bacteria.tax.placements.csv
test.bacteria.ec.csv
test.bacteria.edge_data.csv
test.bacteria.fasta
test.bacteria.pathways.csv
test.bacteria.sample_data.txt
test.bacteria.sum_ec.csv
test.bacteria.sum_pathways.csv
test.bacteria.unique_seqs.csv
test.bacterial16S.reads.txt
test.eukarya.fasta
test.eukaryote18S.reads.txt
test.fasta
test.unique.count
test.unique.fasta

Each sample fasta file that you run will produce similar output. These files are described in detail in the Wiki here, with the following being particularly useful to you:

test.bacteria: Because multiple jplace and other intermediate files are created for the domains Bacteria and Eukarya the number of files quickly gets out of control. Some of these files are useful for downstream analysis and troubleshooting, however, so they are retained in a directory with the sample name.

test.bacteria.combined_16S.bacteria.tax.placements.csv: The *placements.csv file represents a summary of the placement data for each unique query read and is the result of parsing the jplace file(s) produced by epa-ng. For Bacteria and Eukarya, which have multiple reference trees, this file contains the phylogenetic placement results for all the reference trees for a given sample.

test.bacteria.bacteria.edge_data.csv: This is a csv format file containing data on edge location in the reference tree that received a placement, such as the number of reads that placed, predicted 16S rRNA gene copies, number of reads placed normalized to 16S rRNA gene copies, GC content, etc. This file describes the taxonomic structure of your sample.

test.bacteria.unique_seqs.csv: This is a csv file that contains the abundance and normalized abundance of unique sequences. Each sequence is identified by a unique hash
(to allow tallying across multiple samples) and the name of a representative read is also provided.

test.bacteria.bacteria.sum_pathways.csv: This is a csv file of all the metabolic pathways inferred for test.bacteria.fasta, by edge. All possible metabolic pathways are listed (meaning all metabolic pathways that are represented in the database), the number attributed to each edge is given in the column for that edge.

test.bacteria.bacteria.sum_ec.csv: This is a csv file of all the enzymes with EC numbers inferred for test.bacteria.fasta, by edge. The file is structured the same as test.bacteria.bacteria.sum_pathways.csv.

test.bacteria.bacteria.sample_data.txt: This file described some basic information for the sample, such as the database version that was used to make the metabolic inference, the confidence score, total reads used, etc.

test.bacteria.bacteria.sum_pathways.csv: This csv format file describes the metabolic structure of the sample, i.e. pathway abundance across all edges.

If you want to run an analysis with archaeal 16S rRNA or eukaryotic 18S rRNA gene reads you can test the same way. Note that there is no metabolic inference for the domain eukarya, but you the reads are placed on an extensive set of reference trees that are useful for classification and phylogenetic analysis.

./paprica-run.sh test archaea
./paprica-run.sh test eukarya

Conducting an Analysis – read QC

Okay, that was all well and good for the test.fasta file, which has placements only for a single edge and is not particularly exciting. Let’s try something a bit more advanced that mimics what you would do for an actual analysis. First we need to get some data. In this case we’ll use a set of samples from a recent publication, Webb et al. 2019. You can acquire these data from the NCBI SRA site using prefetch. First, create a file titled SRA_select_Run.txt, then populate the file with these accession numbers:

SRX4496910
SRX4496911
SRX4496912
SRX4496913
SRX4496914
SRX4496915
SRX4496916
SRX4496917
SRX4496918
SRX4496919
SRX4496920
SRX4496921
SRX4496922
SRX4496923
SRX4496924
SRX4496925
SRX4496926
SRX4496927
SRX4496928
SRX4496929
SRX4496930
SRX4496931
SRX4496932
SRX4496933
SRX4496934
SRX4496935
SRX4496936
SRX4496937
SRX4496938
SRX4496939
SRX4496940
SRX4496941
SRX4496942
SRX4496943
SRX4496944
SRX4496945
SRX4496946
SRX4496947
SRX4496948
SRX4496949
SRX4496950
SRX4496951
SRX4496952
SRX4496953
SRX4496954

Then execute the following command (if you don’t have it already, install Gnu parallel with your package manger of choice):

parallel fastq-dump --split-files --skip-technical {} < SRA_select_Run.txt

Now you’ll want to QC, denoise, and merge the reads. I’m not going to cover that here, but strongly recommend using dada2. You can use this R script, which is a modification of the tutorial on the dada2 site. Merged reads should be inflated to redundant fasta files before analysis with paprica. You can use our deunique_dada2.py script for that.

Conducting an Analysis – running paprica

Previously you executed paprica on just a test file (test.fasta). Here we have a larger number of samples, and we need to construct a loop to run paprica on all of them. First, copy the paprica/paprica-run.sh file into your working directory:

cp ~/paprica/paprica-run.sh paprica-run.sh

Paprica is pretty fast, and the key parts are already parallelized, so it doesn’t make sense to run the samples in parallel. We can loop across multiple input files like this, specifying the file to run without the “.fasta” extension:

for f in *exp.fasta;do NAME=$(basename $f .fasta); ./paprica-run.sh $NAME bacteria; done

At this point you have individual analysis files for all your samples. These can be quite useful, but most useful are classic abundance tables for edges, unique reads, enzymes, and pathways. You can create these using the paprica-combine_results.py script. This used to be a separate utility but is now a core part of paprica. That means the script should reside in your paprica directory, and you call it from your working directory:

paprica-combine_results.py -domain bacteria -o 2017.07.03_seagrass

This produces several files, note that the prefix in the file names is set by the -o flag in the command above, and reflects the nature of these example reads:

2017.07.03_seagrass_bacteria.taxon_map.txt: This file maps edge numbers to the name of the lowest consensus taxonomy (for CEGs) or strain name (for CCGs).

2017.07.03_seagrass_bacteria.seq_edge_map.txt: This file maps unique sequences to the edge it most frequently places to across samples. The vast majority of reads will place to only a single sample, however, the more samples you have and the more abundant a read is, the more likely it is that you’ll place to more than one edge.

2017.07.03_seagrass_bacteria.edge_data.csv: These are the mean genome parameters for each sample. Lots of good stuff in here, see the column labels.

2017.07.03_seagrass_bacteria.edge_tally.csv: Edge abundance for each sample (corrected for 16S rRNA gene copy). This is your community structure, and is equivalent to an OTU table (but much better!).

2017.07.03_seagrass_bacteria.unique_tally.csv: The abundance and normalized abundance of unique sequences.

2017.07.03_seagrass_bacteria.ec_tally.csv: Enzyme abundance for each sample.

2017.07.03_seagrass_bacteria.path_tally.csv: Metabolic pathway abundance for each sample.

To get familiar with some basic operations on the output files, including exploring the data with heatmaps and correspondence analysis, you can continue with this tutorial.

Tutorial: Installing paprica on Mac OSX

Jeff — Wed, 20 Jan 2016 14:55:38 +0000

The following is a paprica installation tutorial for novice users on Mac OSX (installation on Linux is quite a bit simpler). If you’re comfortable editing your PATH and installing things using bash you probably don’t need to follow this tutorial, just get the dependencies as indicated in the install instructions and linux_install.sh script. If command line operations stress you out, and you haven’t dealt with a lot of weird bioinformatics program installs, use this tutorial.

Please note that this tutorial is a work in progress. If you notice errors, inconsistencies, or omissions please leave a comment and I’ll be sure to correct them. This tutorial has been updated for the most recent version of paprica, and will not work for v0.6 or earlier.

** IMPORTANT ** It is generally considered very poor practice to install anything in the root directory. You might think, “but I’m the only user, so this makes more sense” or “but everyone in the lab wants program X, so I should install as root.” Don’t do it. Install to your home directory. It will add years to your life.

This tutorial assumes you’ve followed this advice, and that you are installing all the dependencies in your home directory.

Install Python and Python packages

paprica is 90 % an elaborate set of wrapper scripts for several core programs written by other groups. The scripts that execute the pipeline are bash scripts, the scripts that do the actual work are Python. Therefore you need Python up and running on your system. If you already have a mainstream v3.0 Python distro going just make sure that the modules listed below are installed (e.g., withconda install [package]and not pip3). Note that Python 3 must be callable on your system as “python3” which should be the default.

Install some necessary Python modules, assuming you don’t already have them:

pip3 install numpy
pip3 install biopython
pip3 install joblib
pip3 install pandas
pip3 install seqmagick
pip3 install termcolor

In case you have conflicts with other Python installations, or some other mysterious problems, it’s a good idea to test things out at this point. Open a shell, type “python3” and:

import numpy
import Bio
import joblib
import pandas
import termcolor

If you get any error messages something somewhere is wrong. Burn some incense and try again. If that doesn’t work try holy water.

Seqmagick is a standalone program, not a module, so check the installation by typing:

seqmagick

You should get a sensible error that is clearly seqmagick yelling at you and not your computer trying to find seqmagick.

Install Homebrew and wget

Older versions of paprica needed the programs pplacer and gappa, which had dependencies that could only be acquired for OSX for a package manager such as Homebrew. These are no longer needed for paprica but I’ve left the Homebrew step in here because if you’re doing anything sciency with your computer you probably want a package manager, and I find wget to be a much more useful file fetching utility than curl.

To download Homebrew (assuming you don’t already have it) type:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Follow the on-screen instructions.

Now install wget.

brew install wget

Install Infernal, pplacer, and epa-ng

Assuming all that went okay go ahead and download the software you need to execute just the paprica-run.sh portion of paprica. First, the excellent aligner Infernal. From your home directory:

wget http://eddylab.org/infernal/infernal-1.1.1-macosx-intel.tar.gz
tar -xzvf infernal-1.1.1-macosx-intel.tar.gz
mv infernal-1.1.1-macosx-intel infernal

Then gappa:

To install gappa you need make and cmake. And a fairly up-to-date C+++11 compiler.

brew install make
brew install cmake

If you try to do the compilation with AppleClang on MacOSX it will probably fail. You need OpenMp, but it is a know issue that AppleClang on MacOSx does not work well with OpenMp. Here we provide two alternatives based on a discussion in gappa’s page, and you can read about it here.

Install gappa via conda, instead of compiling on your own (it does not need OpenMp): https://anaconda.org/bioconda/gappa
Instead of AppleClang, use a “proper” clang:

brew install llvm libomp

You’ll need to set custom paths so that the new clang is used:

#Example on how to do this:
export PATH="$ (brew --prefix llvm) /bin:$PATH";
export COMPILER=/usr/local/opt/llvm/bin/clang++
export CFLAGS="-I /usr/local/include -I/usr/local/opt/llvm/include"
export CXXFLAGS="-I /usr/local/include -I/usr/local/opt/llvm/include"
export LDFLAGS="-L /usr/local/lib -L/usr/local/opt/llvm/lib"
export CXX=${COMPILER}

#Not needed for all MacOS versions
#If you get this error: "ld: unknown option: -platform_version"
#You'll need to add:
export CXXFLAGS="${CXXFLAGS} -mlinker-version=450"
export LDFLAGS="${LDFLAGS} -mlinker-version=450"

And now you should be ready to install gappa:

git clone --recursive https://github.com/lczech/gappa.git
cd gappa
make
cd ~

And finally, epa-ng:

brew install brewsci/bio/epa-ng

Add dependencies to PATH

Now comes the tricky bit, you need to add the locations of the executables for these programs to your PATH variable. This is a pretty important basic computing skill to master. Try not to screw it up. It isn’t hard to undo screw-ups, but it will freak you out because bash will suddenly be unable to find programs that it could find before. Before you continue please read the excellent summary of shell startup scripts as they pertain to OSX here:

http://hayne.net/MacDev/Notes/unixFAQ.html#shellStartup

This tutorial attempts to provide a broad solution to shell startup scripts by sourcing .profile and .bash_profile in .bashrc. I recommend you then only modify .bashrc, though this is not strictly necessary.

## Open .bashrc for editing.

nano .bashrc

At the top of the file type:

source .bash_profile
source .profile

Now navigate to the end of the file and paste the following, modifying as necessary (note: there are lots of syntactic variations for adding a location to PATH, the below commands are a little redundant but clear and easy to modify):

export PATH=/Users/your-user-name/infernal/binaries:${PATH}
export PATH=/Users/your-user-name/infernal/easel:${PATH}
export PATH=/Users/your-user-name/pplacer:${PATH}
export PATH=/Users/your-user-name/epa-ng/bin:${PATH}
export PATH=/Users/your-user-name/paprica:${PATH}
export PATH=/Users/your-user-name/gappa/bin:${PATH}

Don’t be the guy or gal who types your-user-name. Replace with your actual user name. Hit ctrl-o to write out the file, enter to save, and ctrl-x to exit nano.

Re-source .bashrc by typing:

source .bashrc

Confirm that you can execute the following programs by navigating to your home directory and executing each of the following commands:

cmalign
esl-alimerge
gappa
epa-ng

You should get an error message that is clearly from the program, not a bash error like “command not found”.

Get paprica

Okay, now you are ready to get paprica and do some analysis! You can clone the latest repository here :

git clone https://github.com/bowmanjeffs/paprica.git

Now make paprica-run.sh and python scripts executable.

cd paprica 
chmod a+x paprica-run.sh 
chmod a+x *py

At this point you should be ready to rock. Take a deep breath and type:

./paprica-run.sh test bacteria

This analyzes the file test.fasta against the bacteria database. You should see a lot of output flash by on the screen, and you should see a number of new files in the directory with the prefix “test.” Checkout the paprica analysis tutorial and manual for more info on these files.

To run your own analysis, say on amazing_sample.fasta against the bacteria database, simply type:

./paprica-run.sh amazing_sample bacteria

Please, please, please, read the manual for further details. Remember that the fasta file you input should contain only reads that are properly QC’d (i.e. low quality ends and adapters and barcodes and such trimmed away) and denoised (e.g., with dada2).