Clustering metagenomic sequence reads

Another interesting paper caught my eye last week, Nielsen et al. in Nature Biotechnology; Identification and assembly of genomes and genetic elements in complex metagenomic samples without using a reference genome. First, a complaint: 53 authors, really? There are more offensive papers out there in this regard, but come on guys, not everyone in the research group needs be listed. I could be totally wrong on this, but it seems unlikely that everyone on the list touched the data or contributed in a way that justifies authorship. If I’m wrong, my sincere apologies, and more power to them! If I’m not wrong then it would be nice of the top tier journals, where author bloat is most prevalent, to start policing this a little more aggressively.

Moving beyond that I’ve got warm feelings toward this particular (if very large) list of authors, as Nik Blom and others got me started in biological sequence analysis with a lab rotation at DTU back in 2010 (and are co-authors on one of the papers that came from my dissertation work). The Nielsen et al. paper tackles one of the more vexing and important questions regarding biological sequence analysis: Given a metagenome, how can the reads be clustered by genome? Solving this problem allows the metagenome to be more than a tool for evaluating community metabolic potential, as large genomic fragments can be used to probe deeper questions of microbial evolution, diversity, and function.

The problem addressed by the paper isn’t novel; numerous previous papers have reported genomes assembled from metagenome. All of these used some read-clustering method to group like-reads or like-contigs prior to assembly. A nice example of this is given by Iverson et al., 2012. They used coverage statistics, GC content, and other metrics to bin contigs, and ultimately tetranucleotide information (4-base kmer abundance) to group like-scaffolds into a (nearly) complete genome of an uncultured group II Euryarchaeota. There are now a number of papers out that use a philosophically similar approach, often relying on an emerging self-organizing map (ESOM) to perform the actual clustering. This approach seems to work reasonably well, but it is understood that reads are binned at very low resolution with each cluster containing bits from vaguely similar genomes. There is no ability to distinguish between ecotypes or related strains.

Nielsen et al. improved upon these methods by clustering genes based on abundance, they call the resulting clusters CAGs. Applying this method to 396 human gut metagenomes, they assembled 238 unique genomes. If the method survives the scrutiny of the community it represents a big step forward in throughput. Here’s what they did in a little more detail.

The authors started by conducting a de novo assembly of the metagenomes into open reading frames (ORFs). Because ORFs are short this is a relatively simple task. Picking an ORF at random, they then searched the dataset for ORFs with a similar abundance across all samples (having a large number of samples is essential for this approach). ORFs with a similar abundance profile were considered to be from the same genetic element and were called canopies. Canopies of very rare or poorly distributed ORFs were rejected, the remaining canopies were considered to represent genomes and plasmids, as illustrated by the bimodal size distribution in their Fig. 2a:

Taking these canopies as probable genetic entities, the authors then mapped the original sequence reads to each canopy and assembled each pool of reads. Overall a pretty slick method, assuming one has 396 somewhat-similar samples to analyze! I’m curious to know whether a single deeply sequenced sample could be randomly partitioned into virtual samples for a similar analysis. This would make the method available to us mere mortals without 396 metagenomes at our disposal…

3657 Total Views 4 Views Today

Clustering metagenomic sequence reads

Related

Leave a Reply Cancel reply

Post categories

Most recent posts

Copyright

Disclaimer

Meta