Phylogenetic placement re-re-visited

I use phylogenetic placement, namely the program pplacer, in a lot of my publications.  It is also a core part of of the paprica metabolic inference pipeline.  As a result I field a lot questions from people trying to integrate pplacer into their own workflows.  Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users.  In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al.  To follow along with the tutorial start by downloading the sequences here.

You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement.  I strongly recommend using RAxML and Infernal.  After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees.  You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect.  This is double the case for this tutorial… I haven’t come across a covariance model for the 16S/18S gene across domains (if you have please post a comment!), so we will use the covariance model for the domain Bacteria.  You can find that at the Rfam database here.  For the purpose of this tutorial the cm has been renamed “”.  While you’re installing software I also recommend the excellent utility Seqmagick.

Although this tutorial was developed specifically for the 16S rRNA gene, the method is easily adapted to any DNA or protein sequence.  Simply swap out the cmalign command in Infernal for the hmmalign command in HMMER3.0, and download (or create) an appropriate HMM.  You may also need to swap out the model used by RAxML.

The workflow will follow these steps:

  1. Create an alignment of the reference sequences with Infernal
  2. Create a phylogenetic tree of the alignment
  3. Create a reference package from the alignment, tree, and stats file
  4. Proceed with the phylogenetic placement of your query reads

Create an alignment of the reference sequences

The very first thing that you need to do is clean your sequence names of any wonky punctuation.  This is something that trips up almost everyone.  You should really have no punctuation in the names beyond “_”, and no spaces!

tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < hug_tol.fasta > hug_tol.clean.fasta

Next create an alignment from the cleaned file.  I always like to degap first, although it shouldn’t matter.

## Degap
seqmagick mogrify --ungap hug_tol.clean.fasta

## Align
cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam hug_tol.clean.fasta

## Convert to fasta format
seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta

If any sequences are identical after alignment this will mess things up (you should never build a tree from a non-redundant alignment), so make non-redundant.

## Deduplicate sequences
seqmagick mogrify --deduplicate-sequences hug_tol.clean.align.fasta

Build the reference tree

At this point you should have a nice clean DNA alignment in the fasta format.  Now feed it to RAxML to build a tree.  Depending on the size of the alignment this can take a little bit.  I know you’ve read the entire RAxML manual, so of course you are already aware that adding additional cpus won’t help…

## Build tree without confidence scores, just to get stats for Taxtastic
raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345

I like having a sensibly rooted tree; it’s just more pleasing to look at.  You can do this manually, or you can have RAxML try to root the tree for you.

## Root the tree
raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre

Okay, now comes the tricky bit.  Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values.  The work around is to add the confidence scores to the already generated (rooted) tree, so that you a version of the tree with out without scores.  You will feed the scored tree to Taxtastic with the stats file from the unscored tree we already generated.

## Generate confidence scores for tree
raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta

Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.

taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre

Align the query reads

At this point you have the reference package and you can proceed with analyzing some query reads!  The first step is to align the query reads after combining them with the de-duplicated reference alignment*.

## Clean the names
tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < query.fasta > query.clean.fasta

## Concatenate the query reads and reference alignment
cat hug_tol.clean.align.fasta query.clean.fasta > query.hug_tol.clean.fasta

## Remove gaps
seqmagick mogrify --ungap query.hug_tol.clean.fasta

## Align
cmalign --dna -o query.hug_tol.clean.align.sto --outformat Pfam query.hug_tol.clean.fasta

## Convert to fasta
seqmagick convert query.hug_tol.clean.align.sto query.hug_tol.clean.align.fasta

Phylogenetic placement

Now we’re on the home stretch, we can execute the phylogenetic placement itself!  The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?).  You can probably accept most of the flags for the previous commands as is.

pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta

At this point you have a file named query.hug_tol.clean.align.jplace.  You will need to use guppy to convert this json-format file to information that is readable by human.  The two most useful guppy commands (in my experience) for a basic look at your data are:

## Generate an easily parsed csv file of placements, with only a single placement reported for each
## query read.
guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace

## Generate a phyloxml tree with edges fattened according to the number of placements.
guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace

*A far more elegant solution would be to use esl-alimerge included with Infernal to merge the query and reference alignment, after aligning the query reads against  The problem with this is that esl-alimerge needs the original sto file produced by cmalign, and that file contains duplicate sequences not used to build the reference tree.  Easy enough to de-duplicate this file, but the sto format produced by Seqmagick is not compatible with esl-alimerge 🙁

1621 Total Views 1 Views Today
This entry was posted in Research. Bookmark the permalink.

2 Responses to Phylogenetic placement re-re-visited

  1. Ali says:

    Why did you align to and not silva.abe.fasta, as you did in your previous post?

    • Jeff Jeff says:

      I’ve found that Infernal and the bacteria covariance model produce a much better alignment. It’s also quite fast.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments Protected by WP-SpamShield for WordPress