To give her a way to work with the FCS files I put together a quick R script that reads in the file, sets some event limits, and produces a nice plot. With a little modification one could “gate” and count different regions. The script uses the flowCore package to read in the FCS format files, and the hist2d command in gplots to make a reasonably informative plot.

library('flowCore') library('gplots') #### parameters #### f.name <- 'file.name.goes.here' # name of the file you want to analyze, file must have extension ".FCS" sample.size <- 1e5 # number of events to plot, use "max" for all points fsc.ll <- 1 # FSC lower limit ssc.ll <- 1 # SSC lower limit fl1.ll <- 1 # FL1 lower limit (ex488/em536) #### functions #### ## plotting function plot.events <- function(fcm, x.param, y.param){ hist2d(log10(fcm[,x.param]), log10(fcm[,y.param]), col = c('grey', colorRampPalette(c('white', 'lightgoldenrod1', 'darkgreen'))(100)), nbins = 200, bg = 'grey', ylab = paste0('log10(', y.param, ')'), xlab = paste0('log10(', x.param, ')')) box() } #### read in file #### fcm <- read.FCS(paste0(f.name, '.FCS')) fcm <- as.data.frame((exprs(fcm))) #### analyze file and make plot #### ## eliminate values that are below or equal to thresholds you ## defined above fcm$SSC[fcm$SSC <= ssc.ll|fcm$FSC <= fsc.ll|fcm$FL1 == fl1.ll] <- NA fcm <- na.omit(fcm) fcm.sample <- fcm if(sample.size != 'max'){ try({fcm.sample <- fcm[sample(length(fcm$SSC), sample.size),]}, silent = T) } ## plot events in a couple of different ways plot.events(fcm, 'FSC', 'SSC') plot.events(fcm, 'FSC', 'FL1') ## make a presentation quality figure png(paste0(f.name, '_FSC', '_FL1', '.png'), width = 2000, height = 2000, pointsize = 50) plot.events(fcm, 'FSC', 'FL1') dev.off()

And here’s the final plot:

]]>I’m excited to be hosting the fall meeting of NASA’s Outer Planets Assessment Group (OPAG) here at Scripps Institution of Oceanography in September. For planetary scientists at UCSD, SDSU, USD, and other institutions in the greater San Diego area, if you’ve never attended the OPAG meeting here’s your chance! The meeting will be September 6 and 7 at the Samual H. Scripps Auditorium. More details can be found here.

What are assessment groups, and what is OPAG specifically? The assessment groups are an excellent NASA innovation to encourage dialogue within the scientific community, and between the scientific community and NASA HQ. There’s usually a little tense dialogue – in a good way – between these two ends of the scientific spectrum. I often wish NSF had a similar open-format dialogue with its user community! The form of the OPAG meeting is 15 or 30 minute presentations on a range of topics relevant to the community. These are often mission updates, planning updates for future missions, and preliminary results from the analysis of mission data. NASA has quite a few assessment groups, ranging from the Small Body Assessment Group (SBAG – probably the AG with the catchiest acronym) to the Mars Assessment Group (MPAG). OPAG covers everything in the solar system further from the sun than Mars. If that covers your favorite planetary body, come and check it out!

It’s traditional to have a **public** evening lecture with the OPAG meeting. For the upcoming meeting the lecture will be given at **7 pm on September 6 at the Samuel Scripps Auditorium** by my friend and colleague Britney Schmidt, a planetary scientist from Georgia Tech and an expert on Europa and on Antarctic ice sheets. Why and how one can develop that dual expertise will certainly be made clear in her talk. There is no cost to attend, but an RSVP is recommended. You can find more details and RSVP here.

First, let’s get some data. The data are various climate indices with a 13- month time lag as well as sea ice extent, open water extent, and fractional open water extent. There are a couple of years missing from this dataset, and we need to remove one more year for reasons that aren’t important here.

env.params <- read.csv('http://www.polarmicrobes.org/extras/env_params.csv', header = T, row.names = 1) env.params <- env.params[row.names(env.params) != '1994',]

To keep things simple we’ll create our response variable by hand. Nevermind what the response variable is for now, you can read our paper when it comes out

response <- c(0.012, 0.076, 0.074, 0.108, 0.113, 0.154, 0.136, 0.183, 0.210, 0.043, 0.082, 0.092, 0.310, 0.185, 0.436, 0.357, 0.472, 0.631, 0.502)

One thing to note at this point is that, because we have so few observations, we can’t really withhold any to cross-validate the glmnet regression. That’s a shame because it means that we can’t easily optimize the alpha parameter (which determines whether glmnet uses lasso, elastic net, or ridge) as was done here. Instead we’ll use alpha = 0.9. This is to make the analysis a little bit elastic-netty, but still interpretable.

In my real analysis I was using glmnet to identify predictors for lots of response variables. It was therefor useful to write a function to generalize the several commands needed for glmnet. Here’s the function:

library(glmnet) ## The function requires a matrix of possible predictors, a vector with the response variable, ## the GLM family used for the model (e.g. 'gaussian'), the alpha parameter, and "type.measure". ## See the documentation on cv.glmnet for options. Probably you want "deviance". get.glmnet <- function(predictors, response, family, alpha, type.measure){ glmnet.out <- glmnet(predictors, response, family = family, alpha = alpha) glmnet.cv <- cv.glmnet(predictors, response, family = family, alpha = alpha, type.measure = type.measure) ## Need to find the local minima for lambda. lambda.lim <- glmnet.cv$lambda[which.min(glmnet.cv$cvm)] ## Now identify the coefficients that correspond to the lambda minimum. temp.coefficients.i <- coefficients(glmnet.out, s = lambda.lim)@i + 1 # +1 to account for intercept ## And the parameter names... temp.coefficients.names <- coefficients(glmnet.out, s = lambda.lim)@Dimnames[[1]][temp.coefficients.i] temp.coefficients <- coefficients(glmnet.out, s = lambda.lim)@x ## Package for export. temp.coefficients <- rbind(temp.coefficients.names, temp.coefficients) return(temp.coefficients) }

Phew! Okay, let’s try to do something with this. Note that glmnet requires a matrix as input, not a dataframe.

response.predictors <- get.glmnet(data.matrix(env.params), response, 'gaussian', alpha, 'deviance')

```
>t(response.predictors)
temp.coefficients.names temp.coefficients
[1,] "(Intercept)" "0.496410195525042"
[2,] "ao.Apr" "0.009282064516813"
[3,] "ao.current" "-0.0214919836174853"
[4,] "pdo.Mar" "-0.0568728879266135"
[5,] "pdo.Aug" "-0.00881845994191182"
[6,] "pdo.Dec" "-0.0321738320415234"
[7,] "ow.Dec" "1.92231198892721e-06"
[8,] "ow.frac.current" "0.207945851122607"
[9,] "ice.Sep" "-2.29621552264475e-06"
```

So glmnet thinks there are 9 predictors. Possible, but I’m a little suspicious. I’d like to see this in the more familiar GLM format so that I can wrap my head around the significance of the variables. Let’s start by building a null model of all the predictors.

null.model <- lm(response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep, data = env.params)

> summary(null.model) Call: lm(formula = response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep, data = env.params) Residuals: Min 1Q Median 3Q Max -0.14304 -0.03352 -0.01679 0.05553 0.13497 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.052e+00 1.813e+00 1.132 0.2841 ao.Apr 2.226e-02 3.702e-02 0.601 0.5610 ao.current -3.643e-02 1.782e-02 -2.044 0.0681 . pdo.Mar -7.200e-02 3.215e-02 -2.240 0.0490 * pdo.Aug -1.244e-02 2.948e-02 -0.422 0.6821 pdo.Dec -6.072e-02 3.664e-02 -1.657 0.1285 ow.Dec 3.553e-06 1.749e-06 2.032 0.0696 . ow.frac.current 1.187e-01 4.807e-01 0.247 0.8100 ice.Sep -1.023e-05 8.558e-06 -1.195 0.2596 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09043 on 10 degrees of freedom Multiple R-squared: 0.8579, Adjusted R-squared: 0.7443 F-statistic: 7.549 on 8 and 10 DF, p-value: 0.002224

Hmmm… this looks messy. There’s only one predictor with a slope significantly different from 0. A high amount of variance in the original data is accounted for, but it smells like overfitting to me. Let’s QC this model a bit by 1) checking for autocorrelations, 2) using AIC and relative likelihood to further eliminate predictors, and 3) selecting the final model with ANOVA.

## Check for autocorrelations using variance inflation factors library(car)

> vif(null.model) ao.Apr ao.current pdo.Mar pdo.Aug pdo.Dec ow.Dec ow.frac.current 1.610203 1.464678 1.958998 2.854688 2.791749 1.772611 1.744653 ice.Sep 1.510852

Surprisingly, all the parameters have acceptable vif scores (vif < 5). Let’s proceed with relative likelihood.

## Define a function to evaluate relative likelihood. rl <- function(aicmin, aici){ return(exp((aicmin-aici) / 2)) } ## Construct multiple possible models, by adding parameters by order of abs(t value) in summary(null.lm). model.1 <- lm(response ~ pdo.Mar, data = env.params) model.2 <- lm(response ~ pdo.Mar + ao.current, data = env.params) model.3 <- lm(response ~ pdo.Mar + ao.current + ow.Dec, data = env.params) model.4 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec, data = env.params) model.5 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep, data = env.params) model.6 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr, data = env.params) model.7 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr + pdo.Aug, data = env.params) ## Collect AIC scores for models. model.null.aic <- AIC(null.model) model.1.aic <- AIC(model.1) model.2.aic <- AIC(model.2) model.3.aic <- AIC(model.3) model.4.aic <- AIC(model.4) model.5.aic <- AIC(model.5) model.6.aic <- AIC(model.6) model.7.aic <- AIC(model.7) ## Identify the model with the lowest AIC score.

> which.min(c(model.1.aic, model.2.aic, model.3.aic, model.4.aic, model.5.aic, model.6.aic, model.7.aic, model.null.aic)) [1] 5

So model.5 has the lowest AIC. We need to check the relative likelihood of other models minimizing information loss. Models with values < 0.05 do not have a significant likelihood of minimizing information loss and can be discarded.

> rl(model.5.aic, model.1.aic) [1] 8.501094e-05 > rl(model.5.aic, model.1.aic) [1] 8.501094e-05 > rl(model.5.aic, model.2.aic) [1] 0.00143064 > rl(model.5.aic, model.3.aic) [1] 0.002415304 > rl(model.5.aic, model.4.aic) [1] 0.7536875 > rl(model.5.aic, model.6.aic) [1] 0.4747277 > rl(model.5.aic, model.7.aic) [1] 0.209965 > rl(model.5.aic, model.null.aic) [1] 0.08183346

Excellent, we can discard quite a few possible mode here; model.1, model.2, and model.3. Last we’ll use ANOVA and the chi-squared test to see if any of the models are significantly different from model.4, *the model with the fewest parameters*.

> anova(model.4, model.5, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 13 0.086169 1 0.012457 0.1704 > anova(model.4, model.6, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 12 0.083887 2 0.01474 0.3485 > anova(model.4, model.7, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr + pdo.Aug Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 11 0.082276 3 0.01635 0.5347 > anova(model.4, null.model, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 10 0.081778 4 0.016849 0.7247

Okay, there’s a whole lot going on there, they important thing is that none of the models are significantly different from the model with the fewest parameters. There are probably some small gains in the performance of those models, but at an increased risk of over fitting. So model.4 is the winner, and we deem pdo.Mar, ao.current, ow.Dec, and pdo.Dec to be the best predictors of the response variable. For those of you who are curious, that’s the March index for the Pacific Decadal Oscillation, the index for the Antarctic Oscillation when the cruise was taking place, within-pack ice open water extent in December (the month prior to the cruise), and the Pacific Decadal Oscillation index in December. Let’s take a look at the model:

> summary(model.4) Call: lm(formula = response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec, data = env.params) Residuals: Min 1Q Median 3Q Max -0.177243 -0.037756 -0.003996 0.059606 0.114155 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.586e-02 5.893e-02 -0.609 0.552563 pdo.Mar -9.526e-02 2.208e-02 -4.315 0.000713 *** ao.current -3.127e-02 1.590e-02 -1.967 0.069308 . ow.Dec 4.516e-06 1.452e-06 3.111 0.007663 ** pdo.Dec -8.264e-02 2.172e-02 -3.804 0.001936 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.08393 on 14 degrees of freedom Multiple R-squared: 0.8287, Adjusted R-squared: 0.7797 F-statistic: 16.93 on 4 and 14 DF, p-value: 2.947e-05

You’ll recall that the null model accounted for 74 % of the variance in the response variable, our final model accounts for 78 % and we’ve dropped several parameters. Not bad! Note that we never could have gotten to this point however, without the holistic search provided by glmnet.

I can still hear some grumbling about over fitting however, so let’s try to address that by bootstrapping. We are crazy data-limited with only 19 observations, but let’s iteratively hold three observations back at random, build a model with the remaining 16 (using our identified parameters), and see how well each model predicts the three that were held back. I created a function to do this so that I could repeat this exercise for lots of different response variables.

## The function requires as input the response vector, a vector of the predictors, ## the data frame of predictors, and the number of iterations you want to run. bootstrap.model <- function(response.vector, predictors, data, n){ predictor.df <- as.data.frame(data[,predictors]) colnames(predictor.df) <- predictors model.predictions <- matrix(ncol = 3, nrow = 0) colnames(model.predictions) <- c('year', 'prediction', 'real') ## How many observations you need to withhold is determined by how many ## iterations you want to run. For example, for 1000 iterations of ## unique random subsets of 19 observations, you need to withold 3. x <- 0 boot <- 0 for(y in (length(response.vector) - 1):1){ while(x < n){ boot <- boot + 1 x <- y ** boot print(x) } } ## Now build n new models. for(i in 1:n){ print(i) predict.i <- sample.int(length(response.vector), boot) train.i <- which(!1:length(response.vector) %in% predict.i) temp.model <- lm(response.vector ~ ., data = predictor.df, subset = train.i) new.data <- as.data.frame(predictor.df[predict.i,]) colnames(new.data) <- predictors temp.predict <- predict(temp.model, newdata = new.data, type = 'response') temp.actual <- response.vector[predict.i] temp.out <- matrix(ncol = 3, nrow = boot) try({temp.out[,1] <- names(response.vector)[predict.i]}, silent = T) temp.out[,2] <- temp.predict temp.out[,3] <- temp.actual model.predictions <- rbind(model.predictions, temp.out) } mean.predicted <- as.data.frame(tapply(as.numeric(model.predictions[,2]), as.character(model.predictions[,3]), mean)) sd.predicted <- as.data.frame(tapply(as.numeric(model.predictions[,2]), as.character(model.predictions[,3]), sd)) ## Make an awesome plot. plot(mean.predicted[,1] ~ as.numeric(row.names(mean.predicted)), ylab = 'Predicted response', xlab = 'Observed response', pch = 19) for(r in row.names(mean.predicted)){ lines(c(as.numeric(r), as.numeric(r)), c(mean.predicted[r,1], mean.predicted[r,1] + sd.predicted[r,1])) lines(c(as.numeric(r), as.numeric(r)), c(mean.predicted[r,1], mean.predicted[r,1] - sd.predicted[r,1])) } abline(0, 1, lty = 2) mean.lm <- lm(mean.predicted[,1] ~ as.numeric(row.names(mean.predicted))) abline(mean.lm) print(summary(mean.lm)) ## Capture stuff that might be useful later in a list. list.out <- list() list.out$model.predictions <- model.predictions list.out$mean.lm <- mean.lm list.out$mean.predicted <- mean.predicted list.out$sd.predicted <- sd.predicted return(list.out) } ## And execute the function... demo.boostrap <- bootstrap.model(response, c('pdo.Mar', 'ao.current', 'ow.Dec', 'pdo.Dec'), env.params, 1000)

Hey, that looks pretty good! The dotted line is 1:1, while the solid line gives the slope of the regression between predicted and observed values. The lines through each point give the standard deviation of the predictions for that point across all iterations. There’s scatter here, but the model predicts low values for low observations, and high for high, so it’s a start…

]]>You could carry out this dimension reduction with pretty much any clustering algorithm; you’re simply grouping samples with like community structure characteristics on the assumption that like communities will have similar ecosystem functions. We use the emergent self organizing map (ESOM), a neural network algorithm, because it allows new data to be classified into an existing ESOM. For example, imagine that you are collecting a continuous time series of microbial community structure data. You build an ESOM to segment your first few years of data, subsequent samples can be quickly classified into the existing model. Thus the taxonomic structure, physiological, and ecological characteristics of the segments are stable over time. There are other benefits to use an ESOM. One is that with many samples (far more than we had in our study), the ESOM is capable of resolving patterns that many other clustering techniques cannot.

There are many ways to construct an ESOM. I haven’t tried a Python-based approach, although I’m keen to explore those methods. For the *ISME J* paper I used the Kohonen package in R, which has a nice publication that describes some applications and is otherwise reasonably well documented. To follow this tutorial you can download our abundance table here. Much of the inspiration, and some of the code for this analysis, follows the (retail) customer segmentation example given here.

For this tutorial you can download a table of the closest estimated genomes and closest completed genomes (analogous to an abundance table) here. Assuming you’ve downloaded the data into your working directory, fire up Kohonen and build the ESOM.

## Kohonen needs a numeric matrix edge.norm <- as.matrix(read.csv('community_structure.csv', row.names = 1)) ## Load the library library('kohonen') ## Define a grid. The bigger the better, but you want many fewer units in the grid ## than samples. 1:5 is a good ballpark, here we are minimal. som.grid <- somgrid(xdim = 5, ydim=5, topo="hexagonal") ## Now build the ESOM! It is worth playing with the parameters, though in ## most cases you will want the circular neighborhood and toroidal map structure. som.model.edges <- som(edge.norm, grid = som.grid, rlen = 100, alpha = c(0.05,0.01), keep.data = TRUE, n.hood = "circular", toroidal = T)

Congratulations! You’ve just constructed your first ESOM. Pretty easy. You’ve effectively clustered the samples into the 25 units that define the ESOM. You can visualize this as such:

plot(som.model.edges, type = 'mapping', pch = 19)

There are the 25 map units, with the toroid split and flattened into 2D. Each point is a sample (row in the abundance table), positioned in the unit that best reflects its community structure. I’m not going to go into any depth on the ESOM algorithm, which is quite elegant, but the version implemented in the Kohonen package is based on Euclidean distance. How well each map unit represents the samples positioned within it is represented by the distance between the map unit and each sample. This can be visualized with:

plot(som.model.edges, type = 'quality', pch = 19, palette.name = topo.colors)

Units with shorter distances in the plot above are better defined by the samples in those units than units with long distances. What distance is good enough depends on your data and objectives.

The next piece is trickier because there’s a bit of an art to it. At this point each sample has been assigned to one of the 25 units in the map. In theory we could call each map unit a “segment” and stop here. It’s beneficial however, to do an additional round of clustering on the map units themselves. Particularly on large maps (which clearly this is not) this will highlight major structural features in the data. Both k-means and hierarchical clustering work fairly well, anecdotally k-means seems to work better with smaller maps and hierarchical with larger maps, but you should evaluate for your data. Here we’ll use k-means. K-means requires that you specify the number of clusters in advance, which is always a fun chicken and egg problem. To solve it we use the within-clusters sum of squares method:

wss.edges <- (nrow(som.model.edges$codes)-1)*sum(apply(som.model.edges$codes,2,var)) for (i in 2:15) { wss.edges[i] <- sum(kmeans(som.model.edges$codes, centers=i)$withinss) } plot(wss.edges, pch = 19, ylab = 'Within-clusters sum of squares', xlab = 'K')

Here’s where the art comes in. Squint at the plot and try to decide the inflection point. I’d call it 8, but you should experiment with whatever number you pick to see if it makes sense downstream.

We can make another plot of the map showing which map units belong to which clusters:

k <- 8 som.cluster.edges <- kmeans(som.model.edges$codes, centers = k) plot(som.model.edges, main = '', type = "property", property = som.cluster.edges$cluster, palette.name = topo.colors) add.cluster.boundaries(som.model.edges, som.cluster.edges$cluster)

Remember that the real shape of this map is a toroid and not a square. The colors represent the final “community segmentation”; the samples belong to map units, and the units belong to clusters. In our paper we termed these clusters “modes” to highlight the fact that there are real ecological properties associated with them, and that (unlike clusters) they support classification. To get the mode of each sample we need to index the sample-unit assignments against the unit-cluster assignments. It’s a little weird until you get your head wrapped around it:

som.cluster.edges$cluster[som.model.edges$unit.classif] [1] 5 7 7 5 2 7 5 3 7 5 2 6 1 1 1 7 5 4 7 7 5 7 7 7 7 7 7 1 4 4 4 4 7 7 7 6 6 6 6 1 1 1 7 5 5 5 1 1 1 5 5 7 7 4 8 7 7 4 7 8 [61] 7 7 7 7 6 5 6 7 7 7 6 4 6 5 4 4 6 2 1 1 1 1 1 4 1 4 4 4

A really important thing to appreciate about these modes is that they are not ordered or continuous. Mode 4 doesn’t necessarily have more in common with mode 5 say, than with mode 1. For this reason it is important to treat the modes as factors in any downstream analysis (e.g. in linear modeling). For our analysis I had a dataframe with bacterial production, chlorophyll concentration, and bacterial abundance, and predicted genomic parameters from paprica. By saving the mode data as a new variable in the dataframe, and converting the dataframe to a zoo timeseries, it was possible to visualize the occurrence of modes, model the data, and test the pattern of modes for evidence of succession. Happy segmenting!

]]>In a 2D representation of an ESOM, the customers most like one another will be organized in geographically coherent regions of the map. Hierarchical or k-means clustering can be superimposed on the map to clarify the boundaries between these regions, which in this case might represent customers that will respond similarly to a targeted ad. *But what’s really cool* about this whole approach is that, unlike with NMDS or PCA or other multivariate techniques based on ordination, new customers can be efficiently classified into the existing groups. There’s no need to rebuild the model unless a new type of customer comes along, and it is easy to identify when this occurs.

Back to microbial ecology. Imagine that you have a lot of samples (in our case a five year time series), and that you’ve described community structure for these samples with 16S rRNA gene amplicon sequencing. For each sample you have a table of OTUs, or in our case closest completed genomes and closest estimated genomes (CCGs and CEGs) determined with paprica. You know that variations in community structure have a big impact on an ecosystem function (e.g. respiration, or nitrogen fixation), but how to test the correlation? There are statistical methods in ecology that get at this, but they are often difficult to interpret. What if community structure could be represented as a simple value suitable for regression models?

Enter microbial community segmentation. Following the customer segmentation approach described above, the samples can be segmented into *modes* based on community structure with an Emergent Self Organizing Map and k-means clustering. Here’s what this looks like in practice:

This segmentation reduces the data for each sample from many dimensions (the number of CCG and CEG present in each samples) to 1. This remaining dimension is a categorical variable with real ecological meaning that can be used in linear models. For example, each mode has certain genomic characteristics:

In panel *a* above we see that samples belonging to modes 5 and 7 (dominated by the CEG *Rhodobacteraceae* and CCG *Dokdonia* MED134, see Fig. 2 above) have the greatest average number of 16S rRNA gene copies. Because this is a characteristic of fast growing, copiotrophic bacteria, we might also associate these modes with high levels of bacterial production.

Because the modes are categorical variables we can insert them right into linear models to predict ecosystem functions, such as bacterial production. Combined with bacterial abundance and a measure of high vs. low nucleic acid bacteria, mode accounted for 76 % of the variance in bacterial production for our samples. That’s a strong correlation for environmental data. What this means in practice is; if you know the mode, and you have some flow cytometry data, you can make a pretty good estimate of carbon assimilation by the bacterial community.

For more on what you can do with modes (such as testing for community succession) check out the article! I’ll post a tutorial on how to segment microbial community structure data into modes using R in a separate post. It’s easier than you think…

]]>- NCBI taxonomy information for each point of placement on the reference tree, including internal nodes.
- Inclusion of the domain Eukarya. This was a bit tricky and requires some further explanation.

Eukaryotic genomes are a totally different beast than their archaeal and bacterial counterparts. First and foremost they are massive. Because of these there aren’t very many completed eukaryotic genomes out there, particularly for singled celled eukaryotes. While a single investigator can now sequence, assemble, and annotate a bacterial or archaeal genome in very little time, eukaryotic genomes still require major efforts by consortia and lots of $$.

One way to get around this scale problem is to focus on eukaryotic *transcriptomes* instead of *genomes*. Because much of the eukaryotic genome is noncoding this greatly reduces sequencing volume. Since there is no such thing as a contiguous transcriptome, this approach also implies that no assembly (beyond open reading frames) will be attempted. The Moore Foundation-funded Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) was an initial effort to use this approach to address the problem of unknown eukaryotic genetic diversity. The MMETSP sequenced transcriptomes from several hundred different strains. The taxonomic breadth of the strains sequenced is pretty good, even if (predictably) the taxonomic resolution is not. Thus, as for archaea, the phylogenetic tree and metabolic inferences should be treated with caution. For eukaryotes there are the additional caveats that 1) not all genes coded in a genome will be represented in the transcriptome 2) the database contains only strains from the marine environment and 3) eukaryotic 18S trees are kind of messy. Considerable effort went into making a decent tree, but you’ve been warned.

Because the underlying data is in a different format, not all genome parameters are calculated for the eukaryotes. 18S gene copy number is not determined (and thus community and metabolic structure are not normalized), the phi parameter, GC content, etc. are also not calculated. However, eukaryotic community structure is evaluated and metabolic structure inferred in the same way as for the domains bacteria and archaea:

./paprica-run.sh test.eukarya eukarya

As always you can install paprica v0.4.0 by following the instructions here, or you can use the virtual box or Amazon Web Service machine instance.

]]>If you are new to Amazon Web Services (AWS) the basic way this works is:

- Sign up for Amazon EC2 using your normal Amazon log-in
- From the AWS console, make sure that your region is N. Virginia (community AMI’s are only available in the region they were created in)
- From your EC2 dashboard, scroll down to “Create Instance” and click “Launch Instance”
- Now select the “Community AMIs”
- Search for “paprica-ec2”, then select the AMI corresponding to the latest version of paprica (0.4.0 at the time of writing).
- Choose the type of instance you would like to run the AMI on. This is the real power of AWS; you can tailor the instance to the analysis you would like to run. For testing choose the free t2.micro instance. This is sufficient to execute the test files or run a small analysis (hundreds of reads). To use paprica’s parallel features select an instance with the desired number of cores and sufficient memory.
- Click “Review and Launch”, and finish setting up the instance as appropriate.
- Log onto the instance, navigate to the paprica directory, execute the test file(s) as described in the paprica tutorial. The AMI is not updated as often as paprica, so you may wish to reclone the github repository, or download the latest stable release.

I’m very excited that our manuscript “*Microbial community dynamics in two polar extremes: The lakes of the McMurdo Dry Valleys and the West Antarctic Peninsula Marine Ecosystem*” has been published as an overview article in the journal *BioScience*. The article belongs to a special issue comparing different ecological aspects of the two NSF-funded Long Term Ecological Research (LTER) sites in Antarctica. I’m actually writing this post on my return trip from the first ever open science meeting of the International Long Term Ecological Research (ILTER) network at Kruger National Park in South Africa (an excellent place to ponder ecological questions).

This article had an odd genesis; the special issue was conceived by John Priscu, a PI with the McMurdo LTER project. I was ensnared in the project along with Trista Vick-Majors, a graduate student with John Priscu (now a postdoctoral scholar at McGill University), shortly after starting my postdoc with Hugh Ducklow, PI on the Palmer LTER project. The guidance we received was more or less “compare the McMurdo and Palmer LTERs”. How exactly we should compare perennially ice-covered lakes in a polar desert to one of the richest marine ecosystems on the planet was left up to us. Fortunately, microbial ecology lends itself to highly reductionist thinking. This isn’t always helpful, but we reasoned that on a basal level the two ecosystems must function more or less the same. Despite dramatically different physical settings, both environments host communities of phytoplankton (sometimes even similar taxonomic groups). These convert solar energy into chemical energy and CO_{2} into organic carbon, thereby supporting communities of heterotrophic bacteria and grazers.

To look at the details of this we stretched the bounds of what constitutes an “overview article” and aggregated nearly two decades of primary production and bacterial production data collected by the McMurdo LTER, and over a decade of the same from the Palmer LTER. By looking at the ratio of bacterial production to primary production we assessed how much carbon the heterotrophic bacterial community takes up relative to how much the phytoplankton community produces.

Typical marine values for this ratio are 1:10. At a value of around 1:5 the carbon demands of heterotrophic bacteria are probably not met by phytoplankton production (the majority of carbon taken up by bacteria is lost through respiration and is not accounted for in the bacterial production assay). Most of the lakes hover around 1:5, with values above this fairly common. Lake Fryxell however, an odd lake at the foot of Canada Glacier, has values that often exceed 1:1! Consistent with previous work on the lakes such high rates of bacterial production (relative to primary production) can only be met by a large external carbon subsidy.

Where does this external carbon come from? Each summer the McMurdo Dry Valleys warm up enough that the various glaciers at the valley peripheries begin to melt. This meltwater fuels chemoautotrophic bacterial communities where the glacier meets rock (the subglacial environment), and microbial mats in various streams and melt ponds. Like microbial communities everywhere these bleed a certain amount of dissolved carbon (and particulate; DOC and POC) into the surrounding water. Some of this carbon ends up in the lakes where it enhances bacterial production.

But external carbon subsidies aren’t the only part of the story. Nutrients, namely phosphate and nitrate, are washed into the lakes as well. During big melt years (such as the summer of 2001-2002 when a major positive SAM coupled to an El Nino caused unusually high temperatures) the lakes receives big pulses of relatively labile carbon but also inorganic nutrients and silt. This odd combination has the effect of suppressing primary production in the near term through lowered light levels (all that silt), enhancing it in the long term (all those nutrients), and giving heterotrophic bacteria some high quality external carbon to feed on during the period that primary production is suppressed. Or at least that’s how we read it.

Not a lake person? How do things work over in the Palmer LTER? One of the biggest ecological differences between Palmer and McMurdo is that the former has grazers (e.g. copepods, salps, and krill) and the latter does not, or at least not so many to speak off. Thus an argument can be made that carbon dynamics at Palmer are driven (at least partially) by top-down controls (i.e. grazers), while at McMurdo they are dependent almost exclusively on bottom-up (i.e. chemical and physical) controls.

At times the difference between bacterial production and primary production is pretty extreme at Palmer. In the summer of 2006 for example, bacterial production was only 3 % of primary production (see Fig. 4 in the publication), and the rate of primary production that summer was pretty high. The krill population was also pretty high that year; at the top of their 4-year abundance cycle (see Saba et al. 2014, Nature Communications). This is speculative, but I posit that bacterial production was low in part because a large amount of carbon was being transferred via krill to the higher trophic levels and away from bacteria. This is a complicated scenario because krill can be good for bacteria; sloppy feeding produces DOC and krill excrete large amounts of reduced nitrogen and DOC. Krill also build biomass and respire however, and their large fecal pellets sink quickly, these could be significant losses of carbon from the photic zone.

Antarctica is changing fast and in ways that are difficult to predict. Sea ice seems to be growing in the east Antarctic as it is lost from the west Antarctic, and anomalous years buck this trend in both regions. A major motivation for this special issue was to explore how the changing environment might drive ecological change. I have to say that after spending a good portion of the (boreal) summer and fall thinking about this, some of that time from the vantage point of Palmer Station, I have no idea. All of the McMurdo Lakes react differently to anomalous years, and Palmer as a region seems to react differently to each of abnormal year. I think the krill story is an important concept to keep in mind here; ecological responses are like superimposed waveforms. Picture a regularly occurring phenomenon like the El-Nino Southern Oscillation imposing a periodicity on sea ice cover, which we know has a strong influence on biology. Add a few more oscillating waves from other physical processes. Now start to add biological oscillations like the four-year krill abundance cycle. Can we deconvolute this mess to find a signal? Can we forecast it forward? Certainly not with 10 years of data at one site and 20 years at the other (and we’re so proud of these efforts!). Check back next century… if NSF funds these sites that long…

Many thanks to my co-authors for going the distance on this paper, particularly the lake people for many stimulating arguments. I think limnology and oceanography are, conceptually, much less similar than lakes and oceans.

]]>This is going to be a pretty niche topic, but probably useful for someone out there. Lately I’ve been working with a lot of geospatial data for the West Antarctic Peninsula. One of the things that I needed to do was krig the data (krigging is a form of 2D interpolation, I’m using the pracma library for this). Krigging is a problem near coastlines because it assumes a contiguous space to work in. If there happens to be an island or other landmass in the way there is no way to represent the resulting discontinuity in whatever parameter you’re looking at. Because of this I needed to find a way to mask the output. This doesn’t really solve the problem, but at least it allows me to identify areas of concern (for example interpolation that extends across an isthmus, if there are sample points only on one side).

I’m krigging and building maps entirely inside R, which has somewhat immature packages for dealing with geospatial data. The easiest masking solution would be to use filled polygons from any polygon format shapefile that accurately represents the coastline. Unfortunately I couldn’t find an R package that does this correctly with the shapefiles that I have access too. In addition, because of my downstream analysis it was better to mask the data itself, and not just block out landmasses in the graphical output.

Sharon Stammerjohn at the Institute of Arctic and Alpine Research pointed me to the excellent Bathymetry and Global Relief dataset produced by NOAA. This wasn’t a complete solution to the problem but it got me moving in the right direction. From the custom grid extractor at http://maps.ngdc.noaa.gov/viewers/wcs-client/ I selected a ETOPO1 (bedrock) grid along the WAP, with xyz as the output format. If you’re struggling with shapefiles the xyz format is like a cool drink of water, being a 3-column matrix of longitude, latitude, and height (or depth). For the purpose of creating the mask I considered landmass as any lat-long combination with height > 0.

There is one more really, really big twist to what I was trying to do, however. The Palmer LTER uses a custom 1 km pixel grid instead of latitude-longitude. It’s a little easier to conceptualize than lat-long given the large longitude distortions at high latitude (and the inconvenient regional convergence of lat-long values on similar negative numbers). It is also a little more ecologically relevant, being organized parallel to the coastline instead of north to south. Unfortunately this makes the grid completely incompatible with other Euclidean reference systems such as UTM. So before I could use my xyz file to construct a land mask I needed to convert it to the line-station grid system used by the Palmer LTER. If you’re working in lat-long space you can skip over this part.

Many moons ago someone wrote a Matlab script to convert lat-long to line-station which you can find here. Unfortunately I’m not a Matlab user, nor am I inclined to become one. Fortunately it was pretty straightforward to copy-paste the code into R and fix the syntatic differences between the two languages. Three functions in total are required:

## AUTHORS OF ORIGINAL MATLAB SCRIPT: # Richard A. Iannuzzi # Lamont-Doherty Earth Observatory # iannuzzi@ldeo.columbia.edu # based on: LTERGRID program written by Kirk Waters (NOAA Coastal Services Center), February 1997 ## some functions that are used by the main function SetStation <- function(e, n, CENTEREAST, CENTERNORTH, ANGLE){ uu = e - CENTEREAST vv = n - CENTERNORTH z1 = cos(ANGLE) z2 = sin(ANGLE) NorthKm = (z1 * uu - z2 *vv) / 1000 + 600 EastKm = (z2 * uu + z1 * vv) / 1000 + 40 return(c(NorthKm, EastKm)) } CentralMeridian <- function(iz){ if(abs(iz) > 30){ iutz = abs(iz) - 30 cm = ((iutz * 6.0) -3.0) * -3600 } else{ iutz = 30 - abs(iz) cm = ((iutz * 6.0) +3.0) * +3600 } return(cm) } GeoToUTM <- function(lat, lon, zone){ axis = c(6378206.4,6378249.145,6377397.155, 6378157.5,6378388.,6378135.,6377276.3452, 6378145.,6378137.,6377563.396,6377304.063, 6377341.89,6376896.0,6378155.0,6378160., 6378245.,6378270.,6378166.,6378150.) bxis = c(6356583.8,6356514.86955,6356078.96284, 6356772.2,6356911.94613,6356750.519915,6356075.4133, 6356759.769356,6356752.31414,6356256.91,6356103.039, 6356036.143,6355834.8467,6356773.3205,6356774.719, 6356863.0188,6356794.343479,6356784.283666, 6356768.337303) ak0 = 0.9996 radsec = 206264.8062470964 sphere = 9 a = axis[sphere - 1] # major axis size b = bxis[sphere - 1] # minior axis size es = ((1-b^2/a^2)^(1/2))^2 # eccentricity squared slat = lat * 3600 # latitude in seconds slon = -lon * 3600 # longitude in seconds cm = 0 # central meridian in sec iutz = 0 cm = CentralMeridian(zone) # call the function phi = slat/radsec dlam = -(slon - cm)/radsec epri = es/(1.0 - es) en = a/sqrt(1.0 - es * sin(phi)^2) t = tan(phi)^2 c = epri * cos(phi)^2 aa = dlam * cos(phi) s2 = sin(2.0 * phi) s4 = sin(4.0 * phi) s6 = sin(6.0 * phi) f1 = (1.0 - (es/4.)-(3.0*es*es/64.)-(5.0*es*es*es/256)) f2 = ((3*es/8)+(3.0*es*es/32)+(45*es*es*es/1024)) f3 = ((15*es*es/256)+(45*es*es*es/1024)) f4 = (35*es*es*es/3072) em = a*(f1*phi - f2*s2 + f3*s4 - f4*s6) xx = ak0 * en * (aa + (1.-t+c) * aa^3/6 + (5 - 18*t + t*t + 72*c-58*epri)* aa^5/120) + 5e5 yy = ak0 * (em + en * tan(phi) *((aa*aa/2) + (5-t+9*c+4*c*c)*aa^4/24 + (61-58*t +t*t +600*c - 330*epri)* aa^6/720)) if(zone < 0 | slat < 0){ yy = yy + 1e7 } return(c(xx, yy)) } ## This function actually works with your data ll2gridLter <- function(inlat, inlon){ NorthKm = 0 # initialize EastKm = 0 # initialize zone = -20 # set zone (for LTER region, I think) ANGLE = -50 * pi / 180 CENTEREAST = 433820.404 # eastings for station 600.040 CENTERNORTH = 2798242.817 # northings for station 600.040 # take latitude longitude and get station x.y = GeoToUTM(inlat, inlon, zone) NorthKm.EastKm = SetStation(x.y[1], x.y[2], CENTEREAST, CENTERNORTH, ANGLE) return(NorthKm.EastKm) }

Once the functions are defined I used them to convert the lat/long coordinates in the xyz file to line-station.

## Read in xyz file. lat.long.depth <- read.table('etopo1_bedrock.xyz', header = F, col.names = c('long', 'lat', 'depth')) ## Limit to points above sea level. lat.long.land <- lat.long.depth[which(lat.long.depth$depth >= 0),] ## Create a matrix to hold the output. line.station.land <- matrix(ncol = 3, nrow = length(lat.long.land$long)) colnames(line.station.depth) <- c('line', 'station', 'depth') ## Execute the ll2gridLter function on each point. Yes, I'm using a loop to do this. for(i in 1:length(lat.long.land$long)){ line.station.land[i,] <- c(ll2gridLter(lat.long.land$lat[i], lat.long.land$long[i]), lat.long.land$depth[i]) print(paste(c(i, line.station.land[i,]))) } ## Write out the matrix. write.csv(line.station.land, 'palmer_grid_landmask.csv', row.names = F, quote = F)

At this point I had a nice csv file with line, station, and elevation. I was able to read this into my existing krigging script and convert into a mask.

## Read in csv file. landmask <- read.csv('palmer_grid_landmask.csv') ## Limit to the lines and stations that I'm interested in. landmask <- landmask[which(landmask[,1] <= 825 & landmask[,1] >= -125),] landmask <- landmask[which(landmask[,2] <= 285 & landmask[,2] >= -25),] ## Interpolated data is at 1 km resolution, need to round off ## to same resolution here. landmask.expand <- cbind(ceiling(landmask[,1]), ceiling(landmask[,2])) ## Unfortunately this doesn't adequately mask the land. Need to expand the size of each ## masked pixel 1 km in each direction. landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), floor(landmask[,2]))) landmask.expand <- rbind(landmask.expand, cbind(ceiling(landmask[,1]), floor(landmask[,2]))) landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), ceiling(landmask[,2]))) landmask.expand <- unique(landmask.expand)

I’m not going to cover how I did the krigging in this post. My krigged data is in matrix called *temp.pred.matrix* with colnames given by ‘x’ followed by ‘station’, as in x20 for station 20, and row names ‘y’ followed by ‘line’, as in y100 for line 100. To convert interpolated points that are actually land to NA values I simply added this line to my code:

temp.pred.matrix[cbind(paste0('y', landmask.expand[,1]), paste0('x', landmask.expand[,2] * -1))]

Here’s what the krigged silicate data looks like after masking.

Excellent. The masked area corresponds with known landmasses; that’s Adelaide Island (home of Rothera Station) coming in at the eastern portion of Line 300, and various islets and the western edge of the Antarctic Peninsula to the northeast. At this point erroneous data has been eliminated from the matrix. Annual inventories of silicate and other nutrients can be more accurately calculated form the data and our eyes are not drawn to interesting features in the interpolation that have no chance of reflecting reality because they are over land. The white doesn’t look that appealing to me in this plot however, so I masked the land with black by adding points to the plot. Again, I’m not going to show the whole plotting routine because some variables would require a lengthy explanation about how my larger dataset is structured. The plot was created using imagep in the oce package. One thing to be aware of is that this command automatically transposes the matrix, thus mask needs to be transposed as well. I did this manually in the following line:

## Show masked points as black squares. points(landmask.expand[,1] ~ {-1 * landmask.expand[,2]}, pch = 15, cex = 0.6)

And the final plot:

]]>

The downside of course, is that the primer ran the risk of being outdated before it even went into review. This was mitigated somewhat by the review process itself, and authors did have a chance to update their various sections. Some sections are more stable than others; the section that I wrote with Shawn McGlynn (now at the Tokyo Institute of Technology) on how life uses energy for example, covers some fairly fundamental ground and is likely to stand the test of time. Less so for sections that cover planetary processes in and outside of our solar system; paradigms are being broken in planetary science as fast as they form!

The Astrobiology Primer is a very valuable document because it takes a complicated and interdisciplinary field of study and attempts to summarize it for a broad audience. Most of the Primer should be accessible to anyone with a basic grasp of science. I wonder if it could even serve as a model for other disciplines. What if the junior scientists in every discipline (perhaps roughly defined by individual NSF or NASA programs) got together once every five years to write an open-access summary of the major findings in their field? This might provide a rich and colorful counterpoint to the valuable but often [dry, esoteric, top-down? There’s an adjective that I’m searching for here but it escapes me] reports produced by the National Academies.

The co-lead editors were critical to the success of the Primer v2. I haven’t explicitly asked Shawn and Kaitlin if they were compensated in any way for this activity – perhaps they rolled some of this work into various fellowships and such over the years. More likely this was one more extracurricular activity carried out on the side. Such is the way science works, and the lines are sufficiently blurred between curricular and extracurricular that most of us don’t even look for them anymore. In recognition of this, and to speed the publication and heighten the quality of a future Primer v3, it would be nice to see NASA produce a specific funding call for a (small!) editorial team. Three years of partial salary and funding for a series of writing workshops would make a huge difference in scope and quality.

]]>