Tutorial: Analysis with paprica

paprikaThis tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

This tutorial currently applies to v0.3 only! If you have version 0.2x I recommend updating.

One final note… I’m not updating all the graphics and text as I develop this tutorial with evolving versions of the paprica database.  Don’t be alarmed if you’re following along and your edge numbers don’t match the edge numbers given in the text or graphics.

Getting Started

I’ve been making a lot of improvements to paprica, our program for conducting metabolic inference on 16S rRNA gene sequence libraries.  The following is a complete analysis example with paprica to clarify the steps described in the manual, and to highlight some of the recent improvements to the method.  I’ll continue to update this tutorial as the method evolves.  This tutorial assumes that you have all the dependencies for paprica-run.sh installed and in your PATH.  If you’re a Mac user you can follow the instructions here.  If you’re a Linux user (including Windows users running Linux in a VirtualBox) installation is a bit simpler, just follow the instructions in the manual.  The current list of dependencies are:

  1.  Infernal (including Easel)
  2.  pplacer (including Guppy)
  3. Python version 2.7
  4. Seqmagick

It also assumes that you have the following Python modules installed:

  1.  Pandas
  2.  Joblib
  3.  Biopython

It further assumes that you have R installed (probably with RStudio), and the Archaeopteryx tree viewer.  R and Archaeopteryx aren’t required for paprica to run, but they are very useful for analyzing the output.  Follow the appropriate instructions for your platform for R and RStudio.  Archaeopteryx is a little more convoluted; after first installing Java, install Archaeopteryx as such:

## Download the jar file
https://dl.dropboxusercontent.com/u/14456265/forester/download/forester_1045.jar
mv forester_1045.jar archaeopteryx.jar
## Download the configuration file and rename to something that works
wget https://dl.dropboxusercontent.com/u/14456265/forester/archaeopteryx/configuration_files/_aptx_configuration_file.txt
mv _aptx_configuration_file.txt aptx_configuration_file

Double click on archaeopteryx.jar to start the program.  Finally, this tutorial assumes that you are using the provided database of metabolic pathways and genome data included in the ref_genome_database directory.  If you want to build a custom database you should follow this tutorial.  All the dependencies are installed and tested?  Before we start let’s get familiar with some terminology.

closest completed genome (CCG): One type of edge: the most closely taxonomically related completed genome to a query read.  This term is only applicable for query reads that place to a terminal edge (branch tip) on the reference tree.

closest estimated genome (CEG): Another type of edge: the set of genes that are estimated to be found  in all members of a clade originating at a branch point in the reference tree.  This term is only applicable to query reads that place to an internal edge on the reference tree.  The CEG is the point of placement.

community structure: The taxonomic structure of a bacterial assemblage.

edge: An edge is a point of placement on a reference tree.  Think of it as a branch of the reference tree.  Edges take the place of OTUs in this workflow, and are ultimately far more interesting and informative than OTUs.  Refer to the pplacer documentation for more.

metabolic structure: The abundance of different metabolic pathways within a bacterial assemblage.

reference tree: This is the tree of representative 16S rRNA gene sequences from all completed Bacterial genomes in Genbank.  The topology of the tree defines what pathways are predicted for internal branch points.

Testing the Installation

You can test your installation of paprica and its dependencies using the provided test.bacteria.fasta or test.archaea.fasta files.  For test.bacteria.fasta, from the paprica directory:

./paprica-run.sh test.bacteria bacteria

The first argument specifies the name of the input fasta file (test.bacteria.fasta) without the extension.  The second argument specified which domain you are analyzing for.  Executing the command produces a variety of output files in the paprica directory:

ls test*
test.bacteria.clean.align.sto
test.bacteria.clean.fasta
test.bacteria.combined_16S.bacteria.tax.clean.align.csv
test.bacteria.combined_16S.bacteria.tax.clean.align.db
test.bacteria.combined_16S.bacteria.tax.clean.align.fasta
test.bacteria.combined_16S.bacteria.tax.clean.align.jplace
test.bacteria.combined_16S.bacteria.tax.clean.align.phyloxml
test.bacteria.combined_16S.bacteria.tax.clean.align.sto
test.bacteria.clean.align.sto
test.bacteria.bacteria.edge_data.csv
test.bacteria.bacteria.pathways.csv
test.bacteria.bacteria.ec.csv
test.bacteria.bacteria.sum_ec.csv
test.bacteria.bacteria.sample_data.txt
test.bacteria.bacteria.sum_pathways.csv
test.bacteria.bacteria.unique_seqs.csv

Each sample fasta file that you run will produce similar output, with the following being particularly useful to you:

test.bacteria.combined_16S.bacteria.tax.clean.align.jplace: This is a file produced by pplacer that contains the placement information for your sample.  You can do all kinds of interesting things with sets of jplace files using Guppy.  Refer to the Guppy documentation for more details.

test.bacteria.combined_16S.bacteria.tax.clean.align.phyloxml: This is a “fat” style tree showing the placements of your query on the reference tree.  You can view this tree using Archaeopteryx.

test.bacteria.bacteria.edge_data.csv: This is a csv format file containing data on edge location in the reference tree that received a placement, such as the number of reads that placed, predicted 16S rRNA gene copies, number of reads placed normalized to 16S rRNA gene copies, GC content, etc.  This file describes the taxonomic structure of your sample.

test.bacteria.unique_seqs.csv: This is a csv file that contains the abundance and normalized abundance of unique sequences. Each sequence is identified by a unique hash
(to allow tallying across multiple samples) and the name of a representative read is also provided.

test.bacteria.bacteria.sum_pathways.csv: This is a csv file of all the metabolic pathways inferred for test.bacteria.fasta, by edge.  All possible metabolic pathways are listed (meaning all metabolic pathways that are represented in the database), the number attributed to each edge is given in the column for that edge.

test.bacteria.bacteria.sum_ec.csv: This is a csv file of all the enzymes with EC numbers inferred for test.bacteria.fasta, by edge. The file is structured the same as test.bacteria.bacteria.sum_pathways.csv.

test.bacteria.bacteria.sample_data.txt: This file described some basic information for the sample, such as the database version that was used to make the metabolic inference, the confidence score, total reads used, etc.

test.bacteria.bacteria.sum_pathways.csv: This csv format file describes the metabolic structure of the sample, i.e. pathway abundance across all edges.

If you want to run an analysis with archaeal 16S rRNA gene reads you can test the archaeal side of the database the same way:

./paprica-run.sh test.archaea archaea

Conducting an Analysis – read QC

Okay, that was all well and good for the test.bacteria.fasta file, which has placements only for a single edge and is not particularly exciting.  Let’s try something a bit more advanced that mimics what you would do for an actual analysis.  Create a directory for some new analysis on your home directory and copy the paprica-run.sh script to it.  You should always set up a new run this way as it keeps your analyses organized and provides a record of how you’ve used paprica-run.sh in each case:

cd ~
mkdir my_analysis
cp paprica/paprica-run.sh my_analysis/

Now add some fastq files for this tutorial.  For your own analysis you will most likely have paired-end Illumina reads, which you’ll need to assemble.  I recommend using PEAR.  The .assembled.fastq file output by PEAR can be input directly into the following workflow.  To make things easy we provide two fastq files for this tutorial.  These files represent a summer and winter surface water sample from Luria et al. 2014 and Bowman and Ducklow, 2015.  I recommend wget for downloads of this sort, if you don’t have it, and don’t want to install it for some reason, use curl.

cd my_analysis
wget http://www.polarmicrobes.org/extras/summer.fastq
wget http://www.polarmicrobes.org/extras/winter.fastq

Next we need to perform some quality control (QC) on these reads.  We’ll use the paprica dependency seqmagick for this, which is both fast and flexible.

### NOTE: This seqmagick command doesn't work with Biopython 1.67 or 1.68.  Use 1.66 until fixed.
### Roll back with conda install Biopython=1.66
## Trim for quality.
seqmagick quality-filter summer.fastq summer.fasta --min-mean-quality 30 --min-length 100 --max-ambiguous 0
seqmagick quality-filter winter.fastq winter.fasta --min-mean-quality 30 --min-length 100 --max-ambiguous 0

It used to be that at this point in the analysis you needed to remove undesirable taxa (chloroplasts masquerading as cyanobacteria, for example).  Paprica now has some built-in capabilities for dealing with this, so we won’t worry about it now.

At this point we have a reasonably well QC’d set of reads.  Depending on your dataset some additional QC may be necessary.  In particular, if paprica exits with a barrage of errors, and you notice that the first error is cmalign complaining about the size of the DP matrix, you may have erroneous reads resulting from amplification or sequencing errors that were not eliminated by the QC steps above.  These reads can be tricky to track down but must be removed.  One way to do this is to conduct an initial “test” alignment using the software Mothur, removing reads that don’t align to the region defined by the primers used, de-gapping the sequences, and feeding these reads back into paprica.  For example, for the file summer.pick.fasta we might perform these additional QC steps to remove reads that don’t start by position 6460 in the 50,000 column Silva alignment, that end after position 25316, or that have an excessive number of homopolymers:

mothur "#align.seqs(fasta=summer.fasta, template=silva.nr_v123.align)"
mothur "#screen.seqs(fasta=summer.align, start=31123, end=34464, maxhomop=5)"
mothur "#degap.seqs(fasta=summer.good.fasta)"

This would produce a file named something like summer.good.ng.fasta, which you would feed into paprica (you could first name it something sensible like summer.qc.fasta first).  These additional QC steps were not necessary for the summer.fasta and winter.fasta files however, so let’s proceed without further modification.

Conducting an Analysis – read QC

During analysis with paprica run time may be a concern for you if you have many query files to run, and/or they are particularly large.  The rate limiting step in paprica is pplacer.  We can speed pplacer up by telling paprica-place_it.py to split the query fasta into several pieces that pplacer will run in parallel.  Be careful of memory useage!  pplacer creates two threads automatically when it runs, and each thread uses about 4 Gb of memory with the bacterial reference tree and alignment (much less for archaea).  So if your system has only 2 cpus and 8 Gb of memory don’t use this option!  If your system has 32 Gb of RAM I’d recommend 3 splits, so that you don’t max things out.

While we’re modifying run parameters let’s make one additional change.  Very likely your QC’d fasta files will contain difference numbers of reads.  We can check this with:

grep -c '>' *fasta
summer.fasta:5270
winter.fasta:8676

It’s generally a good idea to subsample your reads to the size of the smallest library so that you are viewing diversity evenly across samples (see discussion here).  You can get paprica to do this for you by specifying the number of reads paprica-place_it.py should use.  When this option is used paprica will make use of Python’s random.sample function, which will yell at you if you try to collect a sample the same size as the population.  To avoid this specify a value that is one smaller than the smallest library.

To specify the number of splits and the number of reads edit the paprica-place_it.py flags in the paprica-run.sh script:

#Open paprica script

nano paprica-run.sh 

## This is the default line in paprica-run.sh
#paprica-place_it.py -ref_dir ref_genome_database -query $query -ref combined_16S.$domain.tax -splits 1 -domain $domain

## This is the modified line in paprica-run.sh
paprica-place_it.py -ref_dir ref_genome_database -query $query -ref combined_16S.$domain.tax -splits 3 -n 5269 -domain $domain

This will cause paprica to subsample the query file (by random selection) to 5269 reads, and split the subsampled file into three query files that will be run in parallel. The parallel run is blind to you, the output should be identical to a run with no splits (-splits 1). If you use subsampling you’ll also need to change the paprica-tally_pathways.py line, as the input file name will be slightly different.

There are a couple of other options we can employ in paprica-tally_pathways.py.  I mentioned before that paprica can deal with undesirable taxa, to exclude a range of edge numbers from your analysis use the -omit flag (you identify the range to omit by inspecting your trees, and the genome_data.final.csv file in ref_genome_directory).  If you think that pplacer is misbehaving with placements to the reference tree, which can happen with edges and query reads that happen to be compositionally but not phylogenetically similar, use the -override flag (see examples further on in this tutorial).  Here we just make the name change for subsampling and to get the unique sequences file:

## This is the default line in paprica-run.sh
#paprica-tally_pathways.py -ref_dir ref_genome_database -i $query.combined_16S.$domain.tax.clean.align.csv -o $query.$domain -cutoff 0.5 -domain $domain -unique $query.$domain.unique.seqs.csv

## This is the modified line in paprica-run.sh, reflecting name change
## for subsampling. 
paprica-tally_pathways.py -ref_dir ref_genome_database -i $query.sub.combined_16S.$domain.tax.clean.align.csv -o $query.$domain -cutoff 0.5 -domain $domain -unique $query.sub.$domain.unique.seqs.csv

Here we are only analyzing two samples, so running them manually isn’t too much of a pain. But you might have tens or hundreds of samples, and need a way to automate that. We do this with a simple loop. I recommend generating a file with the prefixes of all your query files and using that in the loop. For example the file samples.txt might have:

summer
winter

You can create your own bash script to run your analysis and insert a loop to run paprica on all your samples. If you are new to bash, you could read more about it here.

# Create the script 
nano my_analysis.sh

# Add your bash shell interpreter and  the loop you wish to run

#!/bin/bash
while read f;do
   ./paprica-run.sh $f bacteria
done < samples.txt

Now you can make your script executable:

chmod +x my_analysis.sh

#Run it
./my_analysis.sh

Note that we don’t run them in parallel using say, gnu parallel, because Infernal is intrinsically parallelized, and we already forced pplacer to run in parallel using -splits.  Obviously you could expand the loop to include all the analysis commands, leaving you free to kick off for the day.

Once you’ve executed the loop you’ll see all the normal paprica output, for both samples.  It’s useful to concatenate some of this information for downstream analysis.   The provided utility combine_edge_results.py can do this for you.  Copy it to your working directory:

cp ~/paprica/utilities/combine_edge_results.py combine_edge_results.py

This script will automatically aggregate everything with the specified suffixes.  The suffixes can be any suffix that uniquely identifies your analysis files, or a group of analysis files.  You need to specify a prefix for the output files (in this case my_analysis).

python combine_edge_results.py -edge_in bacteria.edge_data.csv -path_in bacteria.sum_pathways.csv -ec_in bacteria.sum_ec.csv -unique_in bacteria.unique_seqs.csv -o my_analysis

This produces several files:

my_analysis.edge_data.csv: These are the mean genome parameters for each sample.  Lots of good stuff in here, see the column labels.

my_analysis.edge_tally.csv: Edge abundance for each sample (corrected for 16S rRNA gene copy).  This is your community structure, and is equivalent to an OTU table (but much better!).

my_analysis.unique_tally.csv: The abundance and normalized abundance of unique sequences.

my_analysis.ec_tally.csv: Enzyme abundance for each sample.

my_analysis.path_tally.csv: Metabolic pathway abundance for each sample.

Now lets take a closer look at the taxonomic structure of these samples as described by paprica.  We can do this using the Archaeopteryx tree viewer.  Fire up Archaeopteryx and load in the file summer.sub.combined_16S.bacteria.tax.clean.align.phyloxml.  You should see a tree that looks something like this:

summer.sub.combined_16S.tax.clean.alignThis is a “fat” format tree that was created by the Guppy component of pplacer during paprica-place_it.py.  Each edge on the tree has been fattened in accordance with the number of reads that placed there.  The terminal edges (the closest completed genomes, or CCG) are conveniently labeled with the accession number and some kind of reasonable taxonomic name that will in most cases tell you genus, species, and strain.  In the trees produced by pplacer the edge number – the critical identifier connecting read placement with location in the tree – is located in the tree file in the place where confidence values usually go.  If you toggle the “confidence value” switch in Archaeopteryx you get labels of the edges.  Toggling that switch and zooming in on the edge with the most reads reveals it to belong to edge number 4981, CCG Anabaena cylindrus (you can also confirm the edge number by looking in the file summer.edge_data.csv).

summer.sub.combined_16S.tax.clean.align2
If you know your taxonomy you’re raising your eyebrows in suspicion at this point; Anabaena is a genus of freshwater cyanobacteria and is highly unlikely to be found in any abundance in Antarctic marine waters.  Most likely these are chloroplasts that snuck through QC, but just in case we’re missing an opportunity at a Nature paper it’s good to check.  The script utilities/make_edge_fasta.py can help us do this by extracting all the reads that match to that edge in the query fasta file.  It does this by using the information in the csv file created by Guppy.  Note that the fasta file that we fed it is test.bacteria.clean.fasta not test.bacteria.fasta, as the read ids reported in the csv file may no longer match bacteria.fasta if they originally had illegal punctuation.  Also note that we specify a range of edges for the script to return, and that the range returned includes the starting number but not the end number.

cp ~/paprica/utilities/make_edge_fasta.py make_edge_fasta.py
python make_edge_fasta.py -csv summer.sub.combined_16S.bacteria.tax.clean.align.csv -fasta summer.sub.clean.fasta -start 4981 -stop 4982

This produces the file summer.sub.clean_4981_4982.fasta containing all the reads in summer.sub.clean.fasta that were assigned to edge 4981.  To confirm (or refute) your suspicions about the taxonomy Blast a handful of the sequences on the NCBI site against the NCBI 16S database.

Looking back at our tree other abundant CCG include Rhodospirillum centenum and Pelagibacter ubiqueP. ubique is no surprise; this is a ubiquitous marine bacteria.  R. centenum, a purple non-sulfur bacteria, seems a bit suspicious.  To take a closer look we use the same trick as we did for the cyanobacteria, but for edge 5216:

python make_edge_fasta.py -csv summer.sub.combined_16S.bacteria.tax.clean.align.csv -fasta summer.sub.clean.fasta -start 5216 -stop 5217

In this case, if you do enough work with these reads (keep in mind that the edge numbers are specific to each version of the database, so they may be meaningless to your tutorial analysis), you’ll figure out that they belong to the genus Sulfitobacter which, ecologically, is sensible.  These reads have placed with Rhodospirillium because there are no completed Sulfitobacter genomes.  If Sulfitobacter is important to your analysis – it’s certainly important to ours – you can fix this by making use of the database customization feature to add draft genomes to the database.  This requires that you build the database from scratch, which I’ll cover in a separate tutorial.  Until that happens refer to the manual or get in touch and I’ll get you pointed in the right direction.

The other option is to use the -omit and -override flags with paprica-tally_pathways.py to respectively omit a range of edges from the analysis, or override a known bad placement.  For example, if we knew (from inspection of the tree or ref_genome_database/genome_data.final_csv) that the cyanobacteria spanned the edge range 674 to 818 and wanted to additionally replace edge 5804 with 93, and edge 4619 with 4571 (selected arbitrarily):

## This is the modified line in paprica-run.sh, reflecting name change
## for subsampling, an override for two edges, and the omission of a range
## of edges.  Note the quotes for the override flag! 
paprica-tally_pathways.py -ref_dir ref_genome_database -i $query.sub.combined_16S.$domain.tax.clean.align.csv -o $query.$domain -cutoff 0.5 -domain $domain -override '5804|93,4619|4571' -omit '674:818'

These omissions and replacements will be reflected in all of the paprica output files.  Assuming that you’ve re-aggregated the output with combine_edge_results.py, what you have now is a reasonable estimate of the phylogenetic and metabolic structure of these two samples.  What we want to do now is take a closer look so that we can start to say something about their ecology.  There are lots of ways to do this, including bumbling through it in Excel, but I’d recommend using R.  R makes nice graphics and has well developed statistical functions for community ecology (and it’s free!).  The rest of this tutorial assumes that you have an R session going and are in the my_analysis directory.  To better replicate a real analysis I’ve structured the commands as though there were many samples to analyze (here we have only two).

First, read in the aggregated analysis files:

edge.data <- read.csv('my_analysis.edge_data.csv', row.names = 1)
edge.tally <- read.csv('my_analysis.edge_tally.csv', row.names = 1)
ec.tally <- read.csv('my_analysis.ec_tally.csv', row.names = 1)
path.tally <- read.csv('my_analysis.path_tally.csv', row.names = 1

From here it’s choose your own adventure, with the way forward dependent on the questions you’re trying to ask.  I’ll illustrate a few things you can do with the data (most of which will seem a bit silly with two samples).

Question 1 – What metabolic pathways are differentially present between samples?

There are several ways to go about this.  Let’s start by building a simple heatmap of metabolic pathways.  If you were interested in looking at enzymes instead of pathways you could go through the same exact procedure with the ec.tally dataframe.

## remove metabolic pathways that don't appear in either sample
path.tally <- path.tally[which(rowSums(path.tally) > 0),]

## make a nice color gradient for a heatmap
## black is least abundant and white is most abundant
heat.colors <- colorRampPalette(c('black', 'blue', 'green', 'yellow', 'red', 'white'))(100)

## make the heatmap itself
heatmap(as.matrix(path.tally),
        labRow = NA,
        labCol = NA,
        scale = 'none',
        col = heat.colors,
        Colv = NA)

rough_heatmapOf course with more samples we would want to label the columns, and perhaps include the clustering feature so that we can see which samples are more similar.  Here the winter sample is on the right and the summer sample is on the left.  A couple of things jump out immediately.  First, there are more pathways overall in the winter sample.  We’ll get to that shortly.  Second, there are many pathways that are not evenly distributed (or at least don’t appear to be) between these samples.  There are probably interesting ecological reasons for this, so let’s explore in greater depth.  We can start by generating a more informative heatmap.

## find pathways where abundance in winter - abundance in summer is greater than 40 % of the max winter value
## nothing magic about this formula, just a guess
target <- path.tally[which(path.tally[,2] - path.tally[,1] > 0.4 * max(path.tally[,2])),]

## this heatmap makes space for pathway names
heatmap(as.matrix(target),
        labCol = NA,
        scale = 'none',
        col = heat.colors,
        Colv = NA,
        margins = c(5,20)

taget_heatmapOkay, that looks pretty sensible.  Lots of stories we could try to track down here… my eye is drawn to two biosynthesis pathways near the bottom that are much higher in abundance in the winter sample.  What’s up with that?  To see who is responsible for these pathways let’s read in the winter.bacteria.pathways.csv file.  This is a matrix of pathway abundance by edge for the winter sample.  Then we can slice it apart to get the edges of interest.

## read in the pathway abundance matrix
winter.edge.paths <- read.csv('winter.bacteria.pathways.csv', row.names = 1)

## which edges have the target pathway at an abundance > 1?
## yes, this command looks super convoluted.  Welcome to R.
edge.1 <- winter.edge.paths[,winter.edge.paths[which(row.names(winter.edge.paths) == 'glycine biosynthesis IV'),] > 0]

Okay, that made a new dataframe of all the pathways for all the edges that have the glycine biosynthesis IV pathway.  Scrolling through that it’s easy to identify edges 242 and 739 (in this version of the database, may differ for you) as primarily responsible for the occurrence of this pathway.

That’s nice, but who are 242 and 739?  There are many ways to get at this, the easiest is to read in the winter.edge.data.csv file.

## read in edge data for winter
winter.edge.data <- read.csv('winter.bacteria.edge_data.csv', row.names = 1)

If you scroll through that dataframe you’ll quickly pick out edge 242 as Candidatus Pelagibacter ubique HTCC1062 and 739 as Candidatus Thioglobus singularis PS1.  At this point you’ve picked out a pathway of interest (or rather a collection of pathways of interest) based on distribution between the two samples, and linked the occurrence of the pathway back to two distinct genomes.  Pretty cool.  The rest is up to you.

Some final notes… looking at winter.edge.data we see that many of the edges don’t have a taxon name.  Those edges are CEGs not CCGs, so you’ll have to do a little sleuthing to develop a consensus taxonomy.  In the utilities directory there is a script paprica_ref_tree_quick_parse.py that will help you with this.  Feed the script an edge number associated with a CEG and it will spit back all the taxa descendant from that CEG.  You’ll have to figure out what that means.

Here we used a simple heatmap to visualize differences because we only have two samples.  In a real analysis, and even with only two samples, we would apply some kind of statistical method to identify pathways (or enzymes, or edges) that are meaningfully enriched in one sample (or set of samples).  The Wilcox rank-sum test works great for this if you have several samples from two groups, and there some Bayesian statistical techniques that are good even if you only have two samples.

Question 2 – Why are there more pathways inferred for the winter than for the summer sample?

This difference could reflect either an artifact or an interesting ecological point.  Let’s try to tease apart which it is.  Since we subsampled our reads to the same depth it isn’t simply a function of more reads in sample.  One possible explanation however, is that the winter sample has a greater proportion of CCG while the summer sample has a greater proportion of CEG.  Since CCG (being completed genomes) generally have more pathways, this could result in more pathways inferred for the winter sample.

One way to look at this is to look at the confidence score for the samples.  A lower confidence score for the summer sample could indicate more internal placements.  If you open the file winter.bacteria.sample_data.txt you see this:

name    winter.bacteria
sample_confidence    0.531163774452
npathways    741
ppathways    1007
nreads    2581
database_created_at    2016-03-03T00:59:34.792240

Name is of course the sample name, then we see the sample confidence, the number of unique pathways found (741) and the number of pathways possible (that is, the number of unique pathways in the database; 1007), the number of reads analyzed (2581), and the date the database was created.  You can find the creation date included with each database (in combined_16S.bacteria.tax.database_info.txt or combined_16S.archaea.tax.database_info.txt), this helps you remember which version of the database you used for the analysis.  Here’s the same file for the summer sample:

name    summer.bacteria
sample_confidence    0.494386889696
npathways    574
ppathways    1007
nreads    2581
database_created_at    2016-03-03T00:59:34.792240

Interesting, the summer sample confidence score is lower (note: don’t interpret the score as % confidence, it is a unitless number between 0 and 1.  It is only meant to compare between samples in the same analysis).  Let’s look at this a little deeper.  The edge_data.txt files (remember that we already looked at the one for the winter sample) provide lots of helpful information.  For example, perhaps the genomes are larger in the winter sample?  This could account for the difference.  To test this you might try something like this:

## read in edge_data for summer
summer.edge.data <- read.csv('summer.bacteria.edge_data.csv', row.names = 1)

mean(summer.edge.data$npaths_actual)
mean(winter.edge.data$npaths_actual)

This gives you the mean number of pathways for each edge found in the sample, but that isn’t quite right because it ignores that some edges are very abundant and some are very rare.  Taking that into account is a little more complicated:

## calculate actual mean number of pathways in each sample
## note that we multiply by nedge_corrected not nedge so that
## abundance is corrected for multiple 16S rRNA gene copies
mean.paths.summer <- mean(rep(summer.edge.data$npaths_actual, times = summer.edge.data$nedge_corrected))
mean.paths.winter <- mean(rep(winter.edge.data$npaths_actual, times = winter.edge.data$nedge_corrected))

In my analysis the mean number of paths in summer is 150 vs. 137 in winter.  This is what we would expect, winter should host smaller-genomed bacteria optimized to low-energy conditions, but works against any explanation of why there are more pathways in the winter.  Let’s look at some other sample parameters.

Two of the key parameters in summer.edge.data and winter.edge.data are nedge and nedge_corrected.  The former is raw edge abundance, while the second has been normalized for multiple 16S rRNA gene copies.  If for example, you have a taxa with 6 copies of the 16S rRNA gene, a typical 16S rRNA gene amplification will produce 6 x as many amplicons for that taxa than it should.  This will massively bias community structure.  The parameter nedge_corrected is simply nedge divided by n16S (the number of 16S rRNA gene copies for the CCG or estimated for the CEG).  Let’s try our same mean value trick for this parameter:

## calculate actual mean number of 16S rRNA gene copies in each sample
mean.16S.summer <- mean(rep(summer.edge.data$n16S, times = summer.edge.data$nedge_corrected))
mean.16S.winter <- mean(rep(winter.edge.data$n16S, times = winter.edge.data$nedge_corrected))

Ah ha!  I’ve got a mean n16S  value for the winter of 1.4 vs. 2.5 for the summer.  Since our subsampling didn’t take this into account (no way it could, since we didn’t know the composition of the sample) an interesting ecological point produced unequal coverage.  Frustrating, but this is useful information.  We would expect more 16S rRNA genes with a more copiotrophic summer bacterial population and here we can demonstrate it!  If equal coverage is really important to you some fancier work in R can sort this out, or you could go back and re-run the summer sample after making the adjustment manually.  Subsampling the summer sample to ~ 1,780 and the winter to only 4,594 should even things up.

To be continued…

4715 Total Views 2 Views Today
This entry was posted in paprica. Bookmark the permalink.

5 Responses to Tutorial: Analysis with paprica

  1. Jeff Jeff says:

    I’ve made some long overdue updates to this tutorial to get it inline with the latest paprica features and command structure. If you’re working through it and you run into any problems please let me know!

  2. Roman says:

    Thanks for paprica, it is great piece of software I was looking for.

    Btw. I believe that the command for aggregating results is not
    python combine_edge_results.py my_analysis
    but rather
    python combine_edge_results.py my_analysis -edge_in [suffix pattern for edges] -path_in [suffix pattern for paths] -ec_in [suffix pattern for ec numbers] -o [prefix for output]
    e.g.
    python combine_edge_results.py my_analysis -edge_in bacteria.edge_data.csv -path_in bacteria.sum_pathways.csv -ec_in bacteria.sum_ec.csv -o my_analysis

  3. Zhijian Jiang says:

    when I test the installation using command in virtualbox “./paprica-run.sh test.bacteria bacteria”,
    error come out below:
    ./paprica-run.sh: line 18: paprica-place_it.py: command not found
    ./paprica-run.sh: line 23: paprica-tally_pathways.py: command not found

    Would you tell me how to fix this problem, many thanks!

    • Jeff Jeff says:

      Hmmm… I can’t reproduce this problem in the Virtualbox appliance. Can you confirm that you are using that VB appliance exactly as it came, i.e. that you didn’t attempt to carry out the installation steps within the VB appliance? Assuming the indicated Python scripts are located in the paprica directory this problem is usually caused by an issue with the shell script startup files.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments Protected by WP-SpamShield for WordPress