Place it!

The last couple of posts on this blog have been about 16S gene sequences, and how microbial ecologists use these sequences to identify different bacteria and determine evolutionary relationships.  The primary method for the latter is to build phylogenetic trees.  A phylogenetic tree is basically a “family tree” of a bacterial lineage.  A whole field of statistics exists that describes different ways of inferring the phylogeny behind a tree, and evaluating the confidence in a given tree.

Like any other statistical method what you feed into a tree-building algorithm determines the confidence you can have in the final product.  For sequence data a big factor is sequence length.  A large number of long, high quality, well aligned sequences will invariably produce a tree with high confidence scores that probably represents evolutionary reality.  The problem is that these days sequence lengths are rarely long (or of particularly high quality).  For most analyses we favor next-generation sequence data that produces a lot of short sequence “reads”.  You can do a lot with these reads, but what you can’t do is align them to one another (very well), or produce any kind of reasonable tree.

Enter pplacer.  Pplacer is a great program produced by Erik Matsen’s group at the Fred Hutchinson Cancer Research Center (“the Hutch”).  You can read the original paper on pplacer here.  What pplacer does is map short reads to an exisiting, high quality reference tree (created in advance by the user).  This allows the quantitative taxonomic analysis of next-gen sequence data at a resolution surpassing what can be achieved with the standard classification tools (such as RDP).

antarctic_rhizobiales.final.fat

A “fat” tree produced from pplacer. Several hundred thousand 454 reads were classified down to the order level, reads from the target order were placed on a reference tree constructed from near full-length sequences from the RDP. The wide red “edge” represents a large number of placements within the genes Blastobacter.

Pplacer’s a great program, and with pre-compiled executables (mac and linux only) it’s a cinch to get up and running.  It is however, a rather finicky program.  Rarely has history documented an alignment parser that breaks on so many different characters.  Any punctuation in your sequence names for example, will cause pplacer to turn up its nose at your hard-won sequence data and refuse to cooperate.

I used to maintain an elaborate series of shell scripts for prepping my query and reference sequences for pplacer, laboriously modifying them for each new data set.  I finally got tired of doing this and wrote a wrapper script in Python to clean my ref and query fasta files and do the initial alignment.  The script (PlaceIt) can be found here.  The wrapper relies on the silva core set for alignment, using mothur, so it is only suitable for 16S analysis.  It would be easy to modify it to work with any other method of alignment (note that the Matsen group has a whole suite of scripts for working with pplacer, some of these are bound to be more suitable for many analyses).  If pplacer sounds useful for your work you should check it out!  And maybe PlaceIt can help you…

4539 Total Views 8 Views Today
This entry was posted in Research. Bookmark the permalink.

One Response to Place it!

  1. Avatar photo Jeff says:

    My pplacer methods have gone through a lot of evolution over the years, if you reading this I suggest you take a look at this post with my updated methods.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield