I’m happy to report that a paper I wrote during my postdoc at the Lamont-Doherty Earth Observatory was published online today in the ISME Journal. The paper, Bacterial community segmentation facilitates the prediction of ecosystem function along the coast of the western Antarctic Peninsula, uses a novel technique to “segment” the microbial community present in many different samples into a few groups (“modes”) that have specific functional, ecological, and genomic attributes. The inspiration for this came when I stumbled across this blog entry on an approach used in marketing analytics. Imagine that a retailer has a large pool of customers that it would like to pester with ads tailored to purchasing habits. It’s too cumbersome to develop an individualized ad based on each customer’s habits, and it isn’t clear what combination of purchasing-habit parameters accurately describe meaningful customer groups. Machine learning techniques, in this case emergent self-organizing maps (ESOMs), can be used to sort the customers in a way that optimizes their similarity and limits the risk of overtraining the model (including parameters that don’t improve the model).
In a 2D representation of an ESOM, the customers most like one another will be organized in geographically coherent regions of the map. Hierarchical or k-means clustering can be superimposed on the map to clarify the boundaries between these regions, which in this case might represent customers that will respond similarly to a targeted ad. But what’s really cool about this whole approach is that, unlike with NMDS or PCA or other multivariate techniques based on ordination, new customers can be efficiently classified into the existing groups. There’s no need to rebuild the model unless a new type of customer comes along, and it is easy to identify when this occurs.
Back to microbial ecology. Imagine that you have a lot of samples (in our case a five year time series), and that you’ve described community structure for these samples with 16S rRNA gene amplicon sequencing. For each sample you have a table of OTUs, or in our case closest completed genomes and closest estimated genomes (CCGs and CEGs) determined with paprica. You know that variations in community structure have a big impact on an ecosystem function (e.g. respiration, or nitrogen fixation), but how to test the correlation? There are statistical methods in ecology that get at this, but they are often difficult to interpret. What if community structure could be represented as a simple value suitable for regression models?
Enter microbial community segmentation. Following the customer segmentation approach described above, the samples can be segmented into modes based on community structure with an Emergent Self Organizing Map and k-means clustering. Here’s what this looks like in practice:
This segmentation reduces the data for each sample from many dimensions (the number of CCG and CEG present in each samples) to 1. This remaining dimension is a categorical variable with real ecological meaning that can be used in linear models. For example, each mode has certain genomic characteristics:
In panel a above we see that samples belonging to modes 5 and 7 (dominated by the CEG Rhodobacteraceae and CCG Dokdonia MED134, see Fig. 2 above) have the greatest average number of 16S rRNA gene copies. Because this is a characteristic of fast growing, copiotrophic bacteria, we might also associate these modes with high levels of bacterial production.
Because the modes are categorical variables we can insert them right into linear models to predict ecosystem functions, such as bacterial production. Combined with bacterial abundance and a measure of high vs. low nucleic acid bacteria, mode accounted for 76 % of the variance in bacterial production for our samples. That’s a strong correlation for environmental data. What this means in practice is; if you know the mode, and you have some flow cytometry data, you can make a pretty good estimate of carbon assimilation by the bacterial community.
For more on what you can do with modes (such as testing for community succession) check out the article! I’ll post a tutorial on how to segment microbial community structure data into modes using R in a separate post. It’s easier than you think…