Phylogenetic Placement Revisited

UW Oceanography has a student-lead bioinformatics seminar that meets every spring and fall (or at least has since the fall of 2012 – we hope to continue it).  Each quarter is a little different, this time around each seminar attendee will present on a tool or set of tools that they’ve applied in their research.  It’s a good way for us to keep up on the ever-changing slate of available tools for sequence analysis, and to delve a little deeper into the ones we currently use.  This week its my turn, and I’ve prepared a short tutorial on pplacer, a common phylogenetic placement program.  I wrote a post about phylogenetic placement a few months ago.

There is some unpublished data posted on the seminar website, so it unfortunately isn’t public.  To make the tutorial available I’ve copied it here.  What was great about putting this tutorial together was the opportunity to try papara, an alignment program developed by the same group responsible for RAxML.  Check out Scenario 2 in the tutorial for details on papara.

**************************************************************************

This is a tutorial for using pplacer with 16S rRNA genes under two three separate scenarios:

1.  Reference and reference-query alignment using mothur, tree building using RAxML
2.  Reference alignment using mothur, reference-query alignment using papara, tree building using RAxML
3.  Reference alignment using mothur, reference-query alignment using papara, tree building using FastTreeMP

There is really no difference between these 16S rRNA examples and execution with functional gene or aa data, except that of course you would need an alternate alignment strategy (clustalo, hmmalign, muscle) to generate the reference alignment.

Raw data files and scripts appear at the bottom of this page.  You will need to download the following:

mothur
RAxML (compile the PTHREADS versions)
FastTree (use the MP version)
papara – make sure you have recent versions of boost and the gcc compiler, does not compile with gcc v4.2
pplacer (and taxtastic, guppy)
Archaeopteryx

The following commands are specific to the system they were run on, so be sure to chance directory paths where necessary and the number of cpus.  If you are running this on a mac’nbook set cpus/threads to 2 for mothur and RAxML.  For pplacer to work well the query dataset should fit within the taxonomic range represented by the reference tree.  In this example a set of 16S rRNA reads were classified at the level of order, and we will try to place these reads on a reference tree constructed from full length 16S sequences spanning that order.

Scenario 1 – Reference and reference-query alignment using mothur, tree building using RAxML

First, clean all punctuation from sequence names, note that this will also degap the sequences.  This step is necessary because pplacer tends to break on punctuation.

tr " " "_" < rdp_rhizobiales_type.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > rdp_rhizobiales_type.clean.fasta

Now use mothur to align the reference sequences against an existing high-quality (we hope) alignment.  I’ve used a concatenation of the Eukaryote, Bacteria, and Archaea Silva alignments available on the mothur website.  It’s too big to share on this site, and is overkill for this analysis, but easy to put together if you want to.  Otherwise just use the Bacteria reference file found here.

mothur "#align.seqs(candidate=rdp_rhizobiales_type.clean.fasta, \
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(trump=., vertical=T, processors=8);\
unique.seqs()"

## rename to something convenient

mv rdp_rhizobiales_type.clean.filter.unique.fasta \
rdp_rhizobiales_type.final.fasta

Now build the reference tree with RAxML.  You’ll note that the tree is not bootstrapped, pplacer can’t handle the statistics file that comes with the bootstrapped tree.  It is possible to work around this by generating multiple trees (with multiple statistics files), here we just aren’t going to worry about it.  In practice reviewers don’t seem to worry that much about it either.  Note that we need the alignment in phy format, and that RAxML won’t write over files of the same name.

perl ./fasta2phyml.pl rdp_rhizobiales_type.final.fasta
rm RAxML*
raxmlHPC-Pthreads -m GTRGAMMA \
-n rdp_rhizobiales_type.final \
-p 1234 -T 8 \
-s rdp_rhizobiales_type.final.phymlAln

pplacer wants the original fasta reference alignment, the reference tree, and the tree building statistics file all packaged together in one re-usable package.  The use of reference packages also aids in the implementation of some of pplacers fancier features, like taxonomic classification.  Ref package creation is accomplished with the taxit command from taxtastic, which comes bundled with pplacer.  Taxit won’t overwrite an existing reference package, so it’s useful to remove it beforehand.

rm -r rdp_type_rhizobiales.refpkg
taxit create -l 16S -P rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats RAxML_info.rdp_rhizobiales_type.final \
--tree-file RAxML_bestTree.rdp_rhizobiales_type.final

Now we prep the query reads in the same manner as the reference sequences.

tr " " "_" < galand_2009_rhizobiales.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > galand_2009_rhizobiales.clean.fasta

Combine the query and reference sequences.

cat galand_2009_rhizobiales.clean.fasta \
rdp_rhizobiales_type.clean.fasta > \
galand_rdp_rhizobiales.combined.clean.fasta

And align.

mothur "#align.seqs(candidate=galand_rdp_rhizobiales.combined.clean.fasta,\
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(vertical=T, processors=8);\
screen.seqs(minlength=50)"

At this point it is generally preferable to look at the alignment and limit the analysis to the earliest and latest start position of your reference alignment.  You can get a rough fix on this by using the mothur command summary.seqs() on the reference only alignment.  You can then use screen.seqs() to eliminate any whacky reads that didn’t align right.  Here I don’t worry about it.  In either case be sure to convert all your gap characters to ‘-‘ as pplacer doesn’t like ‘.’.

tr "." "-" \
galand_rdp_rhizobiales.combined.clean.filter.good.fasta > \
galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta

mv galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta \
galand_rdp_rhizobiales.combined.final.fasta

With a lot of luck pplacer is happy with everything you’ve done up to this point, and won’t yell at you.  Go ahead and run it.  If you start seeing working on… messages roll across your screen you can sigh in relief.

One nasty thing that pplacer does if you aren’t looking out for it is produce multiple placements, as opposed to just the top placement.  If you want to keep things easy in downstream analysis tell it to just keep the best placement.

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p -o galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.combined.final.fasta

If pplacer ran successfully it’s time to make some visualizations.  The guppy program comes with pplacer and converts the .json file to phyloxml, newick, and csv formats.  It also can do some statistics and fancier things.  Here we use it to make a tree with edges fattened in proportion to the number of query sequences placed there.  After that we make a simple table of nodes and the number of placements at each.  This is great if you just want to make a histogram of your placements, which can be a lot easier than dealing with trees.

guppy fat --pp galand_rdp_rhizobiales.jplace

guppy to_csv --pp galand_rdp_rhizobiales.jplace > \
galand_rdp_rhizobiales.csv

Scenario 2 – using papara in place of the second round of mothur

Scenario 2 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with papara, followed by pplacer…

./papara -t RAxML_bestTree.rdp_rhizobiales_type.final \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp

mv papara_alignment.galand_rdp \
papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p \
-o papara_galand_rdp_rhizobiales.jplace \
papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp papara_galand_rdp_rhizobiales.jplace > \
papara_galand_rdp_rhizobiales.csv

Now you can make use of the guppy kr_heat command to highlight difference in the two trees.

guppy kr_heat --pp \
papara_galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.jplace

Visualize all the phyloxml output with Archaeopteryx.
Scenario 3 – use FastTreeMP in place of RAxML

Both pplacer and papara are pretty happy using trees produced by FastTree, which can be a pretty elegant alternative to RAxML.  Although it is much more streamlined than RAxML, it is not quite as accurate, and should not be used on alignments with fewer than 50 sequences.

Scenario 3 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with FastTreeMP followed by papara.  Note: I don’t think FastTreeMP was the default name of that binary, check when you install the parallel version of FastTree…

FastTreeMP -nt -gtr \
-log ft_rdp_rhizobiales_type.final.log \
rdp_rhizobiales_type.final.fasta >\
ft_rdp_rhizobiales_type.final.tre

rm -r ft_rdp_type_rhizobiales.refpkg

taxit create -l 16S -P ft_rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats ft_rdp_rhizobiales_type.final.log \
--tree-file ft_rdp_rhizobiales_type.final.tre

papara -t ft_rdp_rhizobiales_type.final.tre \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp_2

mv papara_alignment.galand_rdp \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c ft_rdp_type_rhizobiales.refpkg \
-p \
-o ft_papara_galand_rdp_rhizobiales.jplace \
ft_papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp ft_papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp ft_papara_galand_rdp_rhizobiales.jplace > \
ft_papara_galand_rdp_rhizobiales.csv

Below: Heat tree showing the difference between mothur and papara alignment after Scenario 2.  Reference tree is RAxML.

 

 
Scripts and data files
11223 Total Views 8 Views Today
This entry was posted in Research. Bookmark the permalink.

2 Responses to Phylogenetic Placement Revisited

  1. Avatar photo Jeff says:

    If you’re interested in using phylogenetic placement on 16S rRNA gene collections I strongly recommend that you use our paprica-place_it.py wrapper for pplacer included as part of the paprica metabolic inference pipeline instead of the method I describe here. The wrapper reflects the best combination of alignment and tree-building that we’ve been able to come up with, and handles some annoying details (like scrubbing the sequence names of bad punctuation) for you!

    • Avatar photo Jeff says:

      Actually, you can now check out this post. The paprica wrapper is excellent if you want to dissect it, but it wont’ work on its own outside of the paprica framework.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield