Computer tutorials – The Bowman Lab

Local installation of DeepTMHMM

Jeff — Wed, 03 Dec 2025 20:38:49 +0000

For one of our ongoing metagenomic projects I needed to split predicted proteins into cytoplasmic and transmembrane groups. After looking at a couple of different options I opted to use DeepTMHMM, part of the biolib framework. DeepTMHMM has a wonderful wrapper that submits jobs to a remote server for analysis. This is great for relatively small numbers of sequences. However, even after clustering by similarity I have well over 10⁵ predicted proteins. The wrapper would be terribly inefficient for this many sequences, and while I couldn’t find any explicit limits on the number of submissions, I presume that it would be considered rude to bomb the server with thousands of fasta files. Thus began my odyssey to run DeepTMHMM locally, and Homeric the journey was. Unfortunately, there was no way to document the step by step which involved serious and increasingly frantic troubleshooting of both hardware and software. Instead I present some cautionary notes on things that didn’t work and solution that ultimately did.

DeepTMHMM makes use of a GPU via PyTorch. None of our servers were equipped with a GPU so step 1 was to acquire one. I have a pretty good, though elderly, Linux box in my office for development work. Some discussion with the manufacturer (Puget Systems) assured me that the motherboard would be compatible with a relatively modern GPU. I selected an NVIDIA GeForce RTX 4060 as a good compromise between cost and performance and a couple days later was ready for surgery. Here’s where some initial mistakes were made. I admit them here so that others may find humor.

Mistake 1 – I borrowed some power cables for the GPU not appreciating that these are specific to the power supply unit. The resulting (mildly traumatic) incident somehow marred the boot sector of the boot drive, though the drive itself was fine. While recovering from that I made Mistake 2:

Mistake 2 – Updating operating system. Development workstation was (is) running Ubuntu LTS 20.04 which is nearing end of support. Since I was replacing the boot drive anyway it seemed like a good idea to modernize things to LTS 24.04.

After a bit of work I had everything up and running (with original power cables which I thankfully had been toting around for 10 years) and the nvidia-smi command showed the GPU alive and talking to the system. Time to install DeepTMHMM.

The “preferred” way of running DeepTMHMM locally is via a docker container. Let the record show that I do not recommend this option. I tried this on the workstation and via WSL on my GPU-equipped laptop. Buried deep deep deep in the stack are some dependency conflicts that prevent a modern implementation of docker from reading the output of the code running inside the outdated container. After a few days of troubleshooting I abandoned this in favor of a local install of DeepTMHMM.

You can obtain an academic license and copy of the software by emailing licensing@biolib.com (thank you ChatGPT for this solution… I don’t think I would have found it otherwise). On the surface it looks straightforward enough. There’s a helpful README file with install instructions and a reasonable number of dependencies. The trick is with the dependency versions.

Following standard best practice I isolated everything in a conda environment using a Python version that matched the docker container (3.8.20). I felt my way through the rest of the dependencies and it all looked good, but regardless of what I tried the specified version of PyTorch couldn’t run code on the GPU. I can’t recall all or even most of the things I tried, but at some point I reached the end of the line and decided to recreate a historically accurate place for DeepTMHMM to call home. This meant reinstalling LTS 20.04 and an older NVIDIA driver (570.133.07). I did need a more recent version of CUDA (12.8) than was bundled with PyTorch, and PyTorch 2.0.1 instead of the one indicated in the README. Magically, and most unexpectedly at this point, everything worked and DeepTMHMM is happily chewing on my data. Should take about 90 hours to run 200,000 or so predictions.

Here’s the final recipe that worked and a modified version of the original README:
OS: Ubuntu LTS 20.04.6
GPU: NVIDIA GeForce RTX 4060
CUDA: v12.8
NVIDIA driver: 570.133.07

### DeepTMHMM 1.0 - Academic Version ###

### Installation ###

# Install system-wide dependencies

sudo apt-get install libhdf5-dev

# Setup a virtual environment

conda create -n deeptmhmm python=3.8
conda activate deeptmhmm

# Install build dependencies (inside environment but not with conda)
python3 -m pip install wheel Cython==0.29.37 pkgconfig==1.5.5

# Install PyTorch
pip install torch==2.0.1

# Install other dependencies (inside environment but not with conda)
python3 -m pip install -r requirements.txt

# Run tool on sample file
python3 predict.py --fasta sample.fasta --output-dir result1

# The result is now available in result1/

The output of nvidia-smi for good measure. Note the GPU memory allocated to python3 for DeepTMHMM:

Alignment and phylogenetic inference with hmmalign and RAxML-ng

Jeff — Tue, 31 May 2022 04:13:03 +0000

RAxML is one of the most popular programs around for phylogenetic inference via maximum likelihood. Similarly, hmmalign within HMMER 3 is a popular way to align amino acid sequences against HMMs from Pfam or created de novo. Combine the two and you have an excellent method for constructing phylogenetic trees. But gluing the two together isn’t exactly seamless and novice users might be deterred by a couple of unexpected hurdles. Recently, I helped a student develop a workflow which I’m posting here.

First, define some variables just to make the bash commands a bit cleaner. REF refers to the name of the Pfam hmm that we’re aligning against (Bac_rhodopsin.hmm in this case), while QUERY is the sequence file to be aligned (hop and bop gene products, plus a dinoflagellate rhodopsin as outgroup).

REF=Bac_rhodopsin
QUERY=uniprot_hop_bop_reviewed

Now, align and convert the alignment to fasta format (required by RAxML-ng).

hmmalign --amino -o $QUERY.sto $REF.hmm $QUERY.fasta
seqmagick convert $QUERY.sto $QUERY.align.fasta

Test which model is best for these data. Here we get LG+G4+F.

modeltest-ng -i $QUERY.align.fasta -d aa -p 8

Check your alignment!

raxml-ng --check --msa $QUERY.align.fasta --model LG+G4+F --prefix $QUERY

Oooh… I bet it failed. Exciting! In this case (using sequences from Uniprot) the long sequence descriptions are incompatible with RAxML-ng. Let’s do a little Python to clean that up.

from Bio import SeqIO

with open('uniprot_hop_bop_reviewed.align.clean.fasta', 'w') as clean_fasta:
    for record in SeqIO.parse('uniprot_hop_bop_reviewed.align.fasta', 'fasta'):
        record.description = ''
        SeqIO.write(record, clean_fasta, 'fasta')

Check again…

raxml-ng --check --msa $QUERY.align.clean.fasta --model LG+G4+F --prefix $QUERY

If everything is kosher go ahead and fire up your phylogenetic inference. Here I’ve limited bootstrapping to 100 trees. If you have the time/resources do more.

raxml-ng --all --msa $QUERY.align.clean.fasta --model LG+G4+F --prefix $QUERY --bs-trees 100

Superimpose the bootstrap support values on the best ML tree.

raxml-ng --support --tree $QUERY.raxml.bestTree --bs-trees $QUERY.raxml.bootstraps

And here’s our creation as rendered by Archaeopteryx. Some day I’ll create a tree that is visually appealing, but today is not that day. But you get the point.

A short tutorial on Gnu Parallel

Luke Piszkin — Wed, 20 Jan 2021 05:36:12 +0000

This post comes form Luke Piszkin, an undergraduate researcher in the Bowman Lab. Gnu Parallel is a must-have utility for anyone that spends a lot of time in Linux Land, and Luke recently had to gain some Gnu Parallel fluency for his project. Enjoy!

*******

GNU parallel is a Linux shell tool for executing jobs in parallel using multiple CPU cores. This is a quick tutorial for increasing your workflow and getting the most out of your machine with parallel. You can find the current distribution here: https://www.gnu.org/software/parallel/. Please try some basic commands to make sure it is working. You will need some basic understanding of “piping” in the command line. I will describe command pipes briefly just for our purposes, but for a more detailed look please see https://www.howtogeek.com/438882/how-to-use-pipes-on-linux/. Piping data in the command line involves taking the output of one command and using it as the input for another. A basic example looks like this:

command_1 | command_2 | command_3 | …

Where the output of command_1 will be used as an input by command_2, command_2 will be used by command_3, and so on. For now, we will only need to use one pipe with parallel. Now let’s look at a basic command run in parallel.

Input: find -type f -name "*.txt" | parallel cat

Output: 
The house stood on a slight rise just on the edge of the village.
It stood on its own and looked over a broad spread of West Country farmland.
Not a remarkable house by any means - it was about thirty years old, squattist, squarish, made of brick, and had four windows set in the front size and proportion which more or less exactly failed to please the eye
The only person for whom the house was in any way special was Arthur Dent, and that was only because it happened to be the one he lived in.
He had lived in it for about three years, ever since he had moved out of London because it made him nervous and irritable

This command makes use of find to list all the .txt files in my directory, then runs cat on them in parallel, which shows the contents of each file on a new line. We can already see how this is much easier than running each command separately, i.e:

In: cat file1.txt

The house stood on a slight rise just on the edge of the village.

In: cat file2.txt

It stood on its own and looked over a broad spread of West Country farmland.

Also, notice how we do not need any placeholder for the files in the second command, because of the pipes. Now let’s take a more complicated example:

find -type f -name "*beta_gal_vibrio_vulnificus_1_100000_0__H_flex=up_*.txt" ! -name "*tally*" | parallel -j 4 python3 PEPCplots.py {} flex log

0.001759374417007663, 0.00033497120199255527, 0.9969940359705531
0.0019773468515624356, 0.00022978867370935437, 0.9969940359705531
0.001332602651915014, 0.0005953339816183529, 0.9969940359705531
0.0015118302435556904, 0.0005040931537659636, 0.9969940359705531
0.001320879258211107, 0.0006907926578169569, 0.9969940359705531
0.0016753759966792244, 0.00041583739269117386, 0.9969940359705302
0.0017187095827331082, 0.00036931151058880094, 0.9969940359705531
0.0017045099726521733, 0.00031386214441070197, 0.9969940359705531
0.001399703145023273, 0.0005196629341168314, 0.9969940359705531
0.001436129272321403, 0.0004806654291442482, 0.9969940359705531

This is an example from my research, it takes in a .txt data file and spits out some parameters that I want to put in a spreadsheet. Like before, we use find to get a list of all the files we want the second command to process. We use ! -name “*tally*” to exclude any files that have “tally” anywhere in the name because we don’t want to process those. In the second command, we have the option -j 4. This tells parallel to use 4 CPU cores, so it can run 4 commands at a time. You can check your computer specs to see how many cores you have available. If your machine has hyper-threading, then it can create virtual cores to run jobs on too. For instance, my dinky laptop only has 2 cores, but with hyper-threading I can use 4. This is another way to improve your efficiency. In the second command you also see a {} placeholder. This spot is filled by whatever the first command outputs. In this case, we need that placeholder because our input files go between other commands. You can also use parallel to run a number of identical commands at the same time. This is helpful if you have a program to run on the same file multiple times. For example:

seq 10 | parallel -N0 cat file1.txt

The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.

Here we use seq as a counting mechanism for how many times to run the second command. You can adjust the number of jobs by changing the seq argument. We include the -N0 flag, which tells parallel to ignore any piped inputs because we aren’t using the first command for inputs this time. Often, I like to include both the time shell tool and the –progress parallel option to see current job status and time for completion:

seq 10 | time parallel --progress -N0 cat file1.txt

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:4/0/100%/0.0s The house stood on a slight rise just on the edge of the village.
local:4/1/100%/1.0s The house stood on a slight rise just on the edge of the village.
local:4/2/100%/0.5s The house stood on a slight rise just on the edge of the village.
local:4/3/100%/0.3s The house stood on a slight rise just on the edge of the village.
local:4/4/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:4/5/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:4/6/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:3/7/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:2/8/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:1/9/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:0/10/100%/0.1s
0.21user 0.46system 0:00.63elapsed 108%CPU (0avgtext+0avgdata 15636maxresident)k
0inputs+0outputs (0major+12089minor)pagefaults 0swaps

And with that, you are well on your way to significantly increasing your computing throughput and using the full potential of your machine. You should now have a sufficient understanding of parallel to construct a command for your own projects, and to explore more complicated applications of parallelization. (Bonus points to whoever knows the book that I used for the text files.)

Finding those lost data files

Jeff — Wed, 16 Sep 2020 15:25:13 +0000

It’s been a long time since I’ve had the bandwidth to write up a code snippet here. This morning I had not quite enough time between Zoom meetings to tackle something more involved, so here goes!

In this case I needed to find ~200 sequence (fasta) files for a student in my lab. They were split across several sequencing runs, and for various logistical reasons it was getting a bit tedious to find the location of each sequence file. To solve the problem I wrote a short Python script to wrap the Linux locate command and copy all the files to a new directory where they could be exported.

First, I created a text file “files2find.txt” with text uniquely matching each file that I needed to find. One of the great things about locate is that it doesn’t need to match the full file name.

head files2find.txt
151117_PAL_Sterivex_1
151126_PAL_Sterivex_2
151202_PAL_Sterivex_3
151213_PAL_Sterivex_4
151225_PAL_Sterivex_5
151230_PAL_Sterivex_6
160106_PAL_Sterivex_7
160118_PAL_Sterivex_9
160120_PAL_Sterivex_10
160128_PAL_Sterivex_11

Then the wrapper:

import subprocess
import shutil

with open('files2find.txt') as file_in:
    for line in file_in:
        line = line.rstrip()

        ## Here we use the subprocess module to run the locate command, capturing
        ## standard out.

        temp = subprocess.Popen('locate ' + line,
                                shell = True,
                                executable = '/bin/bash',
                                stdout = subprocess.PIPE)

        ## The communicate method for object temp returns a tuple.  First object
        ## in the tuple is standard out.       
        
        locations = temp.communicate()[0]
        locations = locations.decode().split('\n')

        ## Thank you internet for this one-liner, Python one-liners always throw
        ## me for a loop (no pun intended). Here we search all items in the locations
        ## list for a specific suffix that identifies files that we actually want.
        ## In this case our final analysis files contain "exp.fasta".  Of course if
        ## you're certain of the full file name you could just use locate on that and
        ## omit this step.

        fastas = [i for i in locations if 'exp.fasta' in i] 
        
        path = '/path/to/where/you/want/files/'
        
        found = set()

        ## Use the shutil library to copy found files to a new directory "path".
        ## Copied files are added to the set "found" to avoid being copied more than
        ## once, if they exist in multiple locations on your computer.
        
        for fasta in fastas:
            file_name = fasta.split('/')[-1]
            if file_name not in found:
                shutil.copyfile(fasta, path + file_name) 
                found.add(file_name)

        ## In the event that no files are found report that here.
                
        if len(fastas) == 0:
            print(line, 'not found')

Tutorial: SuperSOMS and an R script for detecting regions of interest

Jeff — Sun, 17 May 2020 03:13:31 +0000

A common exercise in environmental microbiology is counting bacterial cells with an epifluorescent microscope. During my PhD I spend many hours hunched over a microscope in a darkened room, contemplating which points of light were bacteria (and should thus be counted) and which were not. With a large cup of coffee from Seattle’s U District Bean and Bagel and an episode of “This American Life” playing in the background it’s not a bad way to spend the day. But it definitely felt like a procedure that needed some technological enhancement. My biggest concerns were always objectivity and reproducibility; it’s really hard to determine consistently which “regions of interest” (ROIs) to count. There are of course commercial software packages for identifying and counting ROIs in a microscope image. But why pay big money for a software subscription when you can write your own script? I had some free time during our slow transit to Polarstern during MOSAiC Leg 3 and thought I’d give it a try. The following tutorial borrows heavily from the image segmentation example in Wherens and Kruisselbrink, 2018.

We start with a png image from a camera attached to a microscope. The green features are bacteria and phytoplankton that have been stained with Sybr Green. These are the ROIs that we’d like to identify and count. The image quality is actually pretty bad here; this was made with the epifluorescent scope at Palmer Station back in 2015, and the scope and camera needed some TLC. It turns out that I don’t actually have many such images on my laptop, and I can’t simply go and make a new one because we aren’t allowed in the lab right now! Nonetheless the quality is sufficient for our purposes.

First, we need to convert the image into an RGB matrix that we can work with. I’m sure there’s a way to do this in R, but I couldn’t find an expedient method. Python made it easy.

## convert image to two matrices: a 3 column RGB matrix and
## 2 column xy matrix

import matplotlib.image as mpimg

name = '15170245.png'
name = name.rstrip('.png')

img = mpimg.imread(name + '.png')

with open(name + '.rgb4r.csv', 'w') as rgb_out, open(name + '.xy.csv', 'w') as xy_out:
    for i in range(0, img.shape[1]):
        for j in range(0, img.shape[0]):
            print(img[j, i, 0], img[j, i, 1], img[j, i, 2], sep = ',', file = rgb_out)
            print(i + 1, j + 1, sep = ',', file = xy_out)

Next we break out R to do the actual analysis (which yes, could be done in Python…). The basic strategy is to use a self organizing map (SOM) with 2 layers. One layer will be color, the other will be position. We’ll use this information to identify distinct classes corresponding to diagnostic features of ROIs. Last, we’ll iterate across all the pixels that appear to belong to ROIs and attempt to draw an ellipse around each group of pixels that makes up a ROI. First, we read in the data produced by the Python script:

scope.rgb <- read.csv('15170245.rgb4r.csv', header = F)
scope.xy <- read.csv('15170245.xy.csv', header = F)

colnames(scope.rgb) <- c('red', 'green', 'blue')
colnames(scope.xy) <- c('X', 'Y')

Then we define a function to render the image described by these matrices:

plotImage <- function(scope.xy, scope.rgb){
  image.col <- rgb(scope.rgb[,"red"], scope.rgb[,"green"], scope.rgb[,"blue"], maxColorValue = 1)
  x.dim <- max(scope.xy$X)
  y.dim <- max(scope.xy$Y)
  
  temp.image <- 1:dim(scope.rgb)[1]
  dim(temp.image) <- c(y.dim, x.dim)
  image(0:x.dim,
        0:y.dim,
        t(temp.image),
        col = image.col,
        ylab = paste0('Y:', y.dim),
        xlab = paste0('X:', x.dim))
}

## give it a test

plotImage(scope.xy, scope.rgb)

You'll note that the function flips the image. While annoying, this doesn't matter at all for identifying ROIs. If it bothers you go ahead and tweak the function :). Now we need to train our SOM. The SOM is what does the heavy lifting of identifying different types of pixels in the image.

#### train som ####

som.grid <- somgrid(10, 10, 'hexagonal')
som.model <- supersom(list('rgb' = data.matrix(scope.rgb), 'coords' = data.matrix(scope.xy)), whatmap = c('rgb', 'coords'), user.weights = c(1, 9), grid = som.grid)

Now partition the SOM into k classes with k-means clustering. The value for k has to be determined experimentally but should be consistent for all the images in a set (i.e. a given type of microscopy image).

som.codes <- som.model$codes[[1]]
som.codes.k <- kmeans(som.codes, centers = 6)

## Get mean colors for clusters (classes)

class.col <- c()

for(k in 1:max(som.codes.k$cluster)){
  temp.codes <- som.codes[which(som.codes.k$cluster == k),]
  temp.col <- colMeans(temp.codes)
  temp.col <- rgb(temp.col[1], temp.col[2], temp.col[3], maxColorValue = 1)
  class.col <- append(class.col, temp.col)
}

## Make a plot of the som with the map units colored according to mean color
## of owning class.

plot(som.model,
     type = 'mapping',
     bg = class.col[som.codes.k$cluster],
     keepMargins = F,
     col = NA)

text(som.model$grid$pts, labels = som.codes.k$cluster)
add.cluster.boundaries(som.model, som.codes.k$cluster)

Here's where we have to be a bit subjective. We need to make an informed decision about which classes constitute ROIs. Looking at this map I'm gonna guess 3 and 6. The classes and structure of your map will of course be totally different, even if you start with the same training image. To make use of this information we first predict which classes our original pixels belong to.

## predict for RGB only

image.predict <- predict(som.model, newdata = list('rgb' = data.matrix(scope.rgb)), whatmap = 'rgb')

Then we identify those pixels that below to the classes we think describe ROIs.

## select units that correspond to ROIs

target.units = which(som.codes.k$cluster %in% c(3, 6))
target.pixels <- scope.xy[which(image.predict$unit.classif %in% target.units), c('X', 'Y')]

Now comes the tricky bit. Knowing which pixels belong to ROIs isn't actually that useful, as each ROI is composed of many different pixels. So we need to aggregate the pixels into ROIs. Again, this requires a little experimentation, but once you figure it out for a given sample type it should work consistently. The key parameter here is "resolution" which we define as how far apart two pixels of the same class need to be to constitute different ROIs. The value 20 seems to work reasonably well for this image.

## loop through all pixels.  if there's a pixel within n distance of it, check to
## see if that pixel belongs to an ROI.  If so, add the new pixel to that area.  If not,
## create a new ROI.  Currently a pixel can be "grabbed" by an ROI produced later.

findROI <- function(resolution = 20){
  roi <- 1
  roi.pixels <- as.data.frame(matrix(nrow = dim(target.pixels)[1], ncol = 3))
  colnames(aoi.pixels) <- c('X', 'Y', 'roi')
  
  for(i in 1:dim(target.pixels)[1]){
    
    if(is.na(roi.pixels[i, 'roi']) == T){
      pixel.x <- target.pixels[i, 'X']
      pixel.y <- target.pixels[i, 'Y']
      nns <- which(abs(target.pixels[, 'X'] - pixel.x) < resolution & abs(target.pixels[, 'Y'] - pixel.y) < resolution)
      
      roi.pixels[nns, c('X', 'Y')] <- target.pixels[nns, c('X', 'Y')]
      roi.pixels[nns, 'roi'] <- roi
      roi <- roi + 1
    }
  }
  return(roi.pixels)
}
  
roi.pixels <- findROI()
roi.table <- table(roi.pixels$roi)

To evaluate our discovery of ROIs we plot an ellipse around each ROI in the original image.

## approximate each roi as an ellipse.  need x, y, a, b

plotROI <- function(roi.pixels){
  require(plotrix)
  
  for(roi in unique(roi.pixels$roi)){
    temp.pixels <- roi.pixels[which(roi.pixels$roi == roi),]
    temp.a <- max(temp.pixels$X) - min(temp.pixels$X)
    temp.b <- max(temp.pixels$Y) - min(temp.pixels$Y)
    temp.x <- mean(temp.pixels$X)
    temp.y <- mean(temp.pixels$Y)
    
    plot.y <- temp.y
    draw.ellipse(temp.x, plot.y, temp.a, temp.b, border = 'red')
  }
}

plotImage(scope.xy, scope.rgb)
plotROI(roi.pixels)

It certainly isn't perfect, the two chained diatoms in particular throw off our count. We did, however, do a reasonable job of finding all the small ROIs that represent the smaller, harder to count cells. So how does the model perform for ROI identification on a new image? Here's a new image acquired with the same exposure settings on the same scope. We use the same Python code to convert it to RGB and XY matrices.

## convert image to two matrices: a 3 column RGB matrix and
## 2 column xy matrix

import matplotlib.image as mpimg

name = '14000740.png'
name = name.rstrip('.png')

img = mpimg.imread(name + '.png')

with open(name + '.rgb4r.csv', 'w') as rgb_out, open(name + '.xy.csv', 'w') as xy_out:
    for i in range(0, img.shape[1]):
        for j in range(0, img.shape[0]):
            print(img[j, i, 0], img[j, i, 1], img[j, i, 2], sep = ',', file = rgb_out)
            print(i + 1, j + 1, sep = ',', file = xy_out)

Then we predict and plot.

scope.rgb <- read.csv('14000740.rgb4r.csv', header = F)
scope.xy <- read.csv('14000740.xy.csv', header = F)

colnames(scope.rgb) <- c('red', 'green', 'blue')
colnames(scope.xy) <- c('X', 'Y')

plotImage(scope.xy, scope.rgb)
image.predict <- predict(som.model, newdata = list('rgb' = data.matrix(scope.rgb)), whatmap = 'rgb') # predict for rgb only

target.units = which(som.codes.k$cluster %in% c(3,6))
target.pixels <- scope.xy[which(image.predict$unit.classif %in% target.units), c('X', 'Y')]

roi.pixels <- findROI()
roi.table <- table(roi.pixels$roi)

plotImage(scope.xy, scope.rgb)
plotROI(roi.pixels)

Not bad! Again, it isn't perfect, some ROIs are grouped together and some are missed (largely a function of the variable focal plane). These can be fixed by experimenting with the model parameters and resolution. We did accomplish the goal of improving objectivity and reproducibility; our approach isn't always right, but at least it's wrong in a consistent way! Of course an additional advantage is if you had hundreds of such images, perhaps representing multiple randomly selected fields from many images, you could process in a few minutes what would take many hours to count.

At this point I can feel the collective judgement of every environmental microbiologist since van Leeuwenhoek for promoting a method that might reduce the serendipitous discovery that comes with spending hours staring through a microscope. So here's a reminder to spend time getting familiar with your samples under the microscope, regardless of how you identify ROIs!

Tutorial: Self Organizing Maps in R

Jeff — Sun, 10 May 2020 04:48:14 +0000

Self-organizing maps (SOMs) are a form of neural network and a wonderful way to partition complex data. In our lab they’re a routine part of our flow cytometry and sequence analysis workflows, but we use them for all kinds of environmental data (like this). All of the mainstream data analysis languages (R, Python, Matlab) have packages for training and working with SOMs. My favorite is the R package Kohonen, which is simple to use but can support some fairly complex analysis through SOMs with multiple data layers and supervised learning (superSOMs). The Kohonen package has a nice, very accessible paper that describe its key features, and some excellent examples. This tutorial applies our basic workflow for a single-layer SOM to RGB color data. RGB color space segmentation is a popular way to evaluate machine learning algorithms, as it is intrinsically multi-variate and inherently meaningful. Get like colors grouping together and you know that you’ve set things up correctly!

This application of SOMs has two steps. Each of these steps can be thought of as an independent data reduction step. It’s important to remember that you’re not reducing dimensions per se, as you would in a PCA, you’re aggregating like data so that you can describe them as convenient units (instead of n individual observations). The final outcome, however, represents a reduction in dimensionality to a single parameter for all observations (e.g., the color blue instead of (0, 0, 255) in RGB colorspace). The first step – training the SOM – assigns your observations to map units. The second step – clustering the map units into classes – finds the structure underlying the values associated with the map units after training. At the end of this procedure each observation belongs to a map unit, and each map unit belongs to a class. Thus each observation inherits the class of its associated map unit. If that’s not clear don’t sweat it. It will become clear as you go through the procedure.

First, let’s generate some random RGB data. This takes the form of a three column matrix where each row is a pixel (i.e. an observation).

#### generate some RGB data ####

## select the number of random RGB vectors for training data

sample.size <- 10000

## generate dataframe of random RGB vectors

sample.rgb <- as.data.frame(matrix(nrow = sample.size, ncol = 3))
colnames(sample.rgb) <- c('R', 'G', 'B')

sample.rgb$R <- sample(0:255, sample.size, replace = T)
sample.rgb$G <- sample(0:255, sample.size, replace = T)
sample.rgb$B <- sample(0:255, sample.size, replace = T)

Next, we define a map space for the SOM and train the model. Picking the right grid size for the map space is non-trivial; you want about 5 elements from the training data per map unit, though you’ll likely find that they’re not uniformly distributed. It’s best to use a symmetrical map unless you have a very small training dataset, hexagonal map units, and a toroidal shape. The latter is important to avoid edge effects (a toroid has no edges).

One important caveat for the RGB data is that we’re not going to bother with any scaling or normalizing. The parameters are all on the same scale and evenly distributed between 0 and the max value of 255. Likely your data are not so nicely formed!

#### train the SOM ####

## define a grid for the SOM and train

library(kohonen)

grid.size <- ceiling(sample.size ^ (1/2.5))
som.grid <- somgrid(xdim = grid.size, ydim = grid.size, topo = 'hexagonal', toroidal = T)
som.model <- som(data.matrix(sample.rgb), grid = som.grid)

One you’ve trained the SOM it’s a good idea to explore the output of the `som` function to get a feel for the different items in there. The output takes the form of nested lists. Here we extract a couple of items that we’ll need later, and also create a distance matrix of the map units. We can do this because the fundamental purpose of map units is to have a codebook vector that mimics the structure of the training data. During training each codebook vector is iteratively updated along with its neighbors to match the training data. After sufficient iterations the codebook vectors reflect the underlying structure of the data.

## extract some data to make it easier to use

som.events <- som.model$codes[[1]]
som.events.colors <- rgb(som.events[,1], som.events[,2], som.events[,3], maxColorValue = 255)
som.dist <- as.matrix(dist(som.events))

Now that we have a trained SOM let’s generate a descriptive plot. Since the data are RGB colors, if we color the plot accordingly it should be sensible. For comparison, we first create a plot with randomized codebook vectors. This represents the SOM at the start of training.

## generate a plot of the untrained data.  this isn't really the configuration at first iteration, but
## serves as an example

plot(som.model,
     type = 'mapping',
     bg = som.events.colors[sample.int(length(som.events.colors), size = length(som.events.colors))],
     keepMargins = F,
     col = NA,
     main = '')

And now the trained SOM:

## generate a plot after training.

plot(som.model,
     type = 'mapping',
     bg = som.events.colors,
     keepMargins = F,
     col = NA,
     main = '')

So pretty! The next step is to cluster the map units into classes. As with all clustering analysis, a key question is how many clusters (k) should we define? One way to inform our decision is to evaluate the distance between all items assigned to each cluster for many different values of k. Ideally, creating a scree plot of mean within-cluster distance vs. k will yield an inflection point that suggests a meaningful value of k. In practice this inflection point is extremely sensitive to the size of the underlying data (in this case, the number of map units), however, it can be a useful starting place. Consider that the RGB data were defined continuously, meaning that there is no underlying structure! Nonetheless we still get an inflection point.

#### look for a reasonable number of clusters ####

## Evaluate within cluster distances for different values of k.  This is
## more dependent on the number of map units in the SOM than the structure
## of the underlying data, but until we have a better way...

## Define a function to calculate mean distance within each cluster.  This
## is roughly analogous to the within clusters ss approach

clusterMeanDist <- function(clusters){
  cluster.means = c()
  
  for(c in unique(clusters)){
    temp.members <- which(clusters == c)
    
    if(length(temp.members) > 1){
      temp.dist <- som.dist[temp.members,]
      temp.dist <- temp.dist[,temp.members]
      cluster.means <- append(cluster.means, mean(temp.dist))
    }else(cluster.means <- 0)
  }
  
  return(mean(cluster.means))
  
}

try.k <- 2:100
cluster.dist.eval <- as.data.frame(matrix(ncol = 3, nrow = (length(try.k))))
colnames(cluster.dist.eval) <- c('k', 'kmeans', 'hclust')

for(i in 1:length(try.k)) {
  cluster.dist.eval[i, 'k'] <- try.k[i]
  cluster.dist.eval[i, 'kmeans'] <- clusterMeanDist(kmeans(som.events, centers = try.k[i], iter.max = 20)$cluster)
  cluster.dist.eval[i, 'hclust'] <- clusterMeanDist(cutree(hclust(vegdist(som.events)), k = try.k[i]))
}

plot(cluster.dist.eval[, 'kmeans'] ~ try.k,
     type = 'l')

lines(cluster.dist.eval[, 'hclust'] ~ try.k,
      col = 'red')

legend('topright',
       legend = c('k-means', 'hierarchical'),
       col = c('black', 'red'),
       lty = c(1, 1))

Having picked a reasonable value for k (let’s say k = 20) we can evaluate different clustering algorithms. For our data k-means almost always performs best, but you should choose what works best for your data. Here will evaluate k-means, hierarchical clustering, and model-based clustering. What we’re looking for in the plots is a clustering method that produces contiguous classes. If classes are spread all across the map, then the clustering algorithm isn’t capturing the structure of the SOM well.

#### evaluate clustering algorithms ####

## Having selected a reasonable value for k, evaluate different clustering algorithms.

library(pmclust)

## Define a function for make a simple plot of clustering output.
## This is the same as previousl plotting, but we define the function
## here as we wanted to play with the color earlier.

plotSOM <- function(clusters){
  plot(som.model,
       type = 'mapping',
       bg = som.events.colors,
       keepMargins = F,
       col = NA)
  
  add.cluster.boundaries(som.model, clusters)
}

## Try several different clustering algorithms, and, if desired, different values for k

cluster.tries <- list()

for(k in c(20)){
  
  ## model based clustering using pmclust
  
  som.cluster.pm.em <- pmclust(som.events, K = k, algorithm = 'em')$class # model based
  som.cluster.pm.aecm <- pmclust(som.events, K = k, algorithm = 'aecm')$class # model based
  som.cluster.pm.apecm <- pmclust(som.events, K = k, algorithm = 'apecm')$class # model based
  som.cluster.pm.apecma <- pmclust(som.events, K = k, algorithm = 'apecma')$class # model based
  som.cluster.pm.kmeans <- pmclust(som.events, K = k, algorithm = 'kmeans')$class # model based
  
  ## k-means clustering
  
  som.cluster.k <- kmeans(som.events, centers = k, iter.max = 100, nstart = 10)$cluster # k-means
  
  ## hierarchical clustering
  
  som.dist <- dist(som.events) # hierarchical, step 1
  som.cluster.h <- cutree(hclust(som.dist), k = k) # hierarchical, step 2
  
  ## capture outputs
  
  cluster.tries[[paste0('som.cluster.pm.em.', k)]] <- som.cluster.pm.em
  cluster.tries[[paste0('som.cluster.pm.aecm.', k)]] <- som.cluster.pm.aecm
  cluster.tries[[paste0('som.cluster.pm.apecm.', k)]] <- som.cluster.pm.apecm
  cluster.tries[[paste0('som.cluster.pm.apecma.', k)]] <- som.cluster.pm.apecma
  cluster.tries[[paste0('som.cluster.pm.kmeans.', k)]] <- som.cluster.pm.kmeans
  cluster.tries[[paste0('som.cluster.k.', k)]] <- som.cluster.k
  cluster.tries[[paste0('som.cluster.h.', k)]] <- som.cluster.h
}

## Take a look at the various clusters.  You're looking for the algorithm that produces the
## least fragmented clusters.

plotSOM(cluster.tries$som.cluster.pm.em.20)
plotSOM(cluster.tries$som.cluster.pm.aecm.20)
plotSOM(cluster.tries$som.cluster.pm.apecm.20)
plotSOM(cluster.tries$som.cluster.pm.apecma.20)
plotSOM(cluster.tries$som.cluster.k.20)
plotSOM(cluster.tries$som.cluster.h.20)

For brevity I’m not showing the plots produced for all the different clustering algorithms. For these data the k-means and hierarchical clustering algorithms both look pretty good, I have a slight preference for the k-means version:

The SOM and final map unit clustering represent a classification model that can be saved for use with later data. Once huge advantage to using SOMs over other analysis methods (e.g., ordination techniques) is their usefulness for organizing newly collected data. New data, if necessary scared and normalized in the same way as the training data, can be classified by finding the map unit with the minimum distance to the new observation. To demonstrate this, we’ll generate and classify a small new RGB dataset (in reality classifying in this way is very efficient, and could accommodate a huge number of new observations). First, we save the SOM and final clustering.

## The SOM and map unit clustering represent a classification model.  These can be saved for
## later use.

som.cluster <- som.cluster.k
som.notes <- c('Clustering based on k-means')

save(file = 'som_model_demo.Rdata', list = c('som.cluster', 'som.notes', 'som.model', 'som.events.colors'))

Then we generate new RGB data, classify it, and make a plot to compare the original data, the color of the winning map unit, and the color of the cluster that map unit belongs to.

#### classification ####

## make a new dataset to classify ##

new.data.size <- 20
new.data <- as.data.frame(matrix(nrow = new.data.size, ncol = 3))
colnames(new.data) <- c('R', 'G', 'B')

new.data$R <- sample(0:255, new.data.size, replace = T)
new.data$G <- sample(0:255, new.data.size, replace = T)
new.data$B <- sample(0:255, new.data.size, replace = T)

## get the closest map unit to each point

new.data.units <- map(som.model, newdata = data.matrix(new.data))

## get the classification for closest map units

new.data.classes <- som.cluster[new.data.units$unit.classif]

## compare colors of the new data, unit, and class, first define a function
## to calculate the mean colors for each cluster

clusterMeanColor <- function(clusters){
  cluster.means = c()
  som.codes <- som.model$codes[[1]]
  
  for(c in sort(unique(clusters))){
    temp.members <- which(clusters == c)
    
    if(length(temp.members) > 1){
      temp.codes <- som.codes[temp.members,]
      temp.means <- colMeans(temp.codes)
      temp.col <- rgb(temp.means[1], temp.means[2], temp.means[3], maxColorValue = 255)
      cluster.means <- append(cluster.means, temp.col)
    }else({
      temp.codes <- som.codes[temp.members,]
      temp.col <- rgb(temp.codes[1], temp.codes[2], temp.codes[3], maxColorValue = 255)
      cluster.means <- append(cluster.means, temp.col)
      })
  }
  
  return(cluster.means)
  
}

class.colors <- clusterMeanColor(som.cluster)

plot(1:length(new.data$R), rep(1, length(new.data$R)),
     col = rgb(new.data$R, new.data$G, new.data$B, maxColorValue = 255),
     ylim = c(0,4),
     pch = 19,
     cex = 3,
     xlab = 'New data',
     yaxt = 'n',
     ylab = 'Level')

axis(2, at = c(1, 2, 3), labels = c('New data', 'Unit', 'Class'))

points(1:length(new.data.units$unit.classif), rep(2, length(new.data.units$unit.classif)),
     col = som.events.colors[new.data.units$unit.classif],
     pch = 19,
     cex = 3)

points(1:length(new.data.classes), rep(3, length(new.data.classes)),
       col = class.colors[new.data.classes],
       pch = 19,
       cex = 3)

Looks pretty good! Despite defining only 20 classes, class seems to be a reasonable representation of the original data. Only slight differences in color can be observed between the data, winning map unit, and class.

Tutorial: Nanopore Analysis Pipeline

Sabeel Mansuri — Fri, 07 Dec 2018 15:12:19 +0000

Introduction

Hi! I’m Sabeel Mansuri, an Undergraduate Research Assistant for the Bowman Lab at the Scripps Institute of Oceanography, University of California San Diego. The following is a tutorial that demonstrates a pipeline used to assemble and annotate a bacterial genome from Oxford Nanopore MinION data.

This tutorial will require the following (brief installation instructions are included below):

Software Installation

Canu

Canu is a packaged correction, trimming, and assembly program that is forked from the Celera assembler codebase. Install the latest release by running the following:

git clone https://github.com/marbl/canu.git
cd canu/src
make

Bandage

Bandage is an assembly visualization software. Install it by visiting this link, and downloading the version appropriate for your device.

Prokka

Prokka is a gene annotation program. Install it by visiting this link, and running the installation commands appropriate for your device.

Barrnap

Barrnap is an rRNA prediction software used by Prokka. Install it by visiting this link, and running the installation commands appropriate for your device.

DNAPlotter

DNAPlotter is a gene annotation visualization software. Install it by visiting this link, and running the installation commands appropriate for your device.

Dataset

Download the nanopore dataset located here. This is an isolate from a sample taken from a local saline lake at South Bay Salt Works near San Diego, California.

The download will provide a tarball. Extract it:

tar -xvf nanopore.tar.gz

This will create a runs_fastq folder containing 8 fastq files containing genetic data.

Assembly

Canu can be used directly on the data without any preprocessing. The only additional information needed is an estimate of the genome size of the sample. For the saline isolate, we estimate 3,000,000 base pairs. Then, use the following Canu command to assemble our data:

canu -nanopore_raw -p test_canu -d test_canu runs_fastq/*.fastq genomeSize=3000000 gnuplotTested=true

A quick description of all flags and parameters:

-nanopore_raw – specifies data is Oxford Nanopore with no data preprocessing
-p – specifies prefix for output files, use “test_canu” as default
-d – specifies directory to run test and output files in, use “test_canu” as default
genomeSize – estimated genome size of isolate
gnuplotTested – setting to true will skip gnuplot testing; gnuplot is not needed for this pipeline

Running this command will output various files into the test_canu directory. The assembled contigs are located in the test.contigs.fasta file. These contigs can be better visualized using Bandage.

Assembly Visualization

Opening Bandage and a GUI window should pop up. In the toolbar, click File > Load Graph, and select the test.contigs.gfa. You should see something like the following:

This graph reveals that one of our contigs appears to be a whole circular chromosome! A quick comparison with the test.contigs.fasta file reveals this is Contig 1. We extract only this sequence from the contigs file to examine further. Note that the first contig takes up the first 38,673 lines of the file, so use head:

head -n38673 test_canu/test_canu.contigs.fasta >> test_canu/contig1.fasta

NCBI BLAST

We blast this Contig using NCBI’s nucleotide BLAST database (linked here) with all default options. The top hit is:

Hit: Halomonas sp. hl-4 genome assembly, chromosome: I  
Organism: Halomonas sp. hl-4  
Phylogeny: Bacteria/Proteobacteria/Gammaproteobacteria/Oceanospirillales/Halomonadaceae/Halomonas  
Max score: 65370  
Query cover: 72%  
E value: 0.0  
Ident 87%

It appears this chromosome is the genome of an organism in the genus Halomonas. We may now be interested in the gene annotation of this genome.

Gene Annotation

Prokka will take care of gene annotation, the only required input is the contig1.fasta file.

prokka --outdir circular --prefix test_prokka test_canu/contig1.fasta

The newly created circular directory contains various files with data on the gene annotation. Take a look inside test_prokka.txt for a summary of the annotation. We can take a quick look at the annotation using the DNAPlotter GUI. For a more customized circular plot use circos.

Summary

The analysis above has taken Oxford Nanopore sequenced data, assembled contigs, identified the closest matching organism, and annotated its genome.

Weighted Gene Correlation Network Analysis (WGCNA) Applied to Microbial Communities

Jesse Wilson — Mon, 22 Oct 2018 17:16:24 +0000

Weighted gene correlation network analysis (WGCNA) is a powerful network analysis tool that can be used to identify groups of highly correlated genes that co-occur across your samples. Thus genes are sorted into modules and these modules can then be correlated with other traits (that must be continuous variables).

Originally created to assess gene expression data in human patients, the authors of the WGCNA method and R package have a thorough tutorial with in-depth explanations (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/). More recently, the method has been applied to microbial communities (Duran-Pinedo et al., 2011; Aylward et al., 2015; Guidi et al., 2016; Wilson et al., 2018)–the following is a walk though using microbial sequence abundances and environmental data from my 2018 work (https://www.ncbi.nlm.nih.gov/m/pubmed/29488352/).

Background: WGCNA finds how clusters of genes (or in our case abundances of operational taxonomic units–OTUs) correlates with traits (or in our case environmental variables or biochemical rates) using hierarchical clusters, novel applications of weighted adjacency functions and topological overlap measures, and a dynamic tree cutting method.

Very simply, each OTU is going to be represented by a node in a vast network and the adjacency (a score between 0 and 1) between each set of nodes will be calculated. Many networks use hard-thresholding (where a connection score [e.g. a Pearson Correlation Coefficient] between any two nodes is noted as 1 if it is above a certain threshold and noted as 0 if it is below it). This ignores the actual strength of the connection so WGCNA constructs a weighted gene (or OTU) co-occurrence adjacency matrix in lieu of ‘hard’ thresholding. Because our original matrix has abundance data the abundance of each OTU is also factored in.

For this method to work you also have to select a soft thresholding power (sft) to which each co-expression similarity is raised in order to make these scores “connection strengths”. I used a signed adjacency function:

Adjacency = 0.5*(1+Pearson correlation)^sft

because it preserves the sign of the connection (whether nodes are positively or negatively correlated) and this is recommendation by authors of WGCNA.

You pick your soft thresholding value by using a scale-free topology. This is based on the idea that the probability that a node is connected with k other nodes decays as a power law:

p(k)~ k^(-γ)

This idea is linked to the growth of networks–new nodes are more likely to be attached to already established nodes. In general, scale-free networks display a high degree of tolerance against errors (Zhang & Horvath, 2005).

You then turn your adjacency matrix into a Topological Overlap Measure (TOM) to minimize the effects of noise and spurious associations. A topological overlap of two nodes also factors in all of their shared neighbors (their relative interrelatedness)–so you are basically taking a simple co-occurrence between two nodes and placing it in the framework of the entire network by factoring in all the other nodes each is connected to. For more information regarding adjacency matrices and TOMs please see Zhang & Horvath (2005) and Langfelder & Horvath (2007 & 2008).

Start: Obtain an OTU abundance matrix (MB.0.03.subsample.fn.txt) and environmental data (OxygenMatrixMonterey.csv).

The OTU abundance matrix simply has all the different OTUs that were observed in a bunch of different samples (denoted in the Group column; e.g. M1401, M1402, etc.). These OTUs represent 16S rRNA sequences that were assessed with the universal primers 515F-Y (5′-GTGYCAGCMGCCGCGGTAA) and 926R (5′-CCGYCAATTYMTTTRAGTTT) and were created using a 97% similarity cutoff. These populations were previously subsampled to the smallest library size and all processing took place in mothur (https://www.mothur.org/). See Wilson et al. (2018) for more details.

The environmental data matrix tells you a little bit more about the different Samples, like the Date of collection, which of two site Locations it was collected from, the Depth or Zone of collection. You also see a bunch of different environmental variables like several different Upwelling indices (for different stations and different time spans), community respiration rate (CR), Oxygen Concentration, and Temperature. Again, see Wilson et al. (2018) for more details.

Code–Initial Stuff:

Read data in:

data<-read.table("MB.0.03.subsample.fn.txt",header=T,na.strings="NA")

For this particular file we have to get rid of first three columns since the OTUs don’t actually start until the 4th column:

data1 = data[-1][-1][-1]

You should turn your raw abundance values into a relative abundance matrix and potentially transform it. I recommend a Hellinger Transformation (a square root of the relative abundance)–this effectively gives low weight to variables with low counts and many zeros. If you wanted you could do the Logarithmic transformation of Anderson et al. (2006) here in stead.

library("vegan", lib.loc="~/R/win-library/3.3")
HellingerData<-decostand(data1,method = "hellinger")

You have to limit the OTUs to the most frequent ones (ones that occur in multiple samples so that you can measure co-occurance across samples). I just looked at my data file and looked for where zeros became extremely common. This was easy because mothur sorts OTUs according to abundance. If you would like a more objective way of selecting the OTUs or if your OTUs are not sorted you then this code may help:

lessdata <- data1[,colSums(data1) > 0.05]

(though you will have to decide what cutoff works best for your data).

Code–Making your relative abundance matrix:

You have to reattach the Group Name column:

RelAbun1 = data.frame(data[2],HellingerData[1:750])

Write file (this step isn’t absolutely necessary, but you may want this file later at some point):

write.table(RelAbun1, file = "MontereyRelAbun.txt", sep="\t")

Code–Visualizing your data at the sample level:

Now load the WGCNA package:

library("WGCNA", lib.loc="~/R/win-library/3.3")

Bring data in:

OTUs<-read.table("MontereyRelAbun.txt",header=T,sep="\t")
dim(OTUs);
names(OTUs);

Turn the first column (sample name) into row names (so that only OTUs make up actual columns):

datExpr0 = as.data.frame((OTUs[,-c(1)]));
names(datExpr0) = names(OTUs)[-c(1)];
rownames(datExpr0) = OTUs$Group;

Check Data for excessive missingness:

gsg = goodSamplesGenes(datExpr0[-1], verbose = 3);
gsg$allOK

You should get TRUE for this dataset given the parameters above. TRUE means that all OTUs have passed the cut. This means that when you limited your OTUs to the most common ones above that you didn’t leave any in that had too many zeros. It is still possible that you were too choosy though. If you got FALSE for your data then you have to follow some other steps that I don’t go over here.

Cluster the samples to see if there are any obvious outliers:

sampleTree = hclust(dist(datExpr0), method = "average");

sizeGrWindow(12,9)

par(cex = 0.6);
par(mar = c(0,4,2,0))

plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)

The sample dendrogram doesn’t show any obvious outliers so I didn’t remove any samples. If you need to remove some samples then you have to follow some code I don’t go over here.

Now read in trait (Environmental) data and match with expression samples:

traitData = read.csv("OxygenMatrixMonterey.csv");
dim(traitData)
names(traitData)

Form a data frame analogous to expression data (relative abundances of OTUs) that will hold the Environmental traits:

OTUSamples = rownames(datExpr0);
traitRows = match(OTUSamples, traitData$Sample);
datTraits = traitData[traitRows, -1];
rownames(datTraits) = traitData[traitRows, 1];
collectGarbage()

Outcome: Now your OTU expression (or abundance) data are stored in the variable datExpr0 and the corresponding environmental traits are in the variable datTraits. Now you can visualize how the environmental traits relate to clustered samples.

Re-cluster samples:

sampleTree2 = hclust(dist(datExpr0), method = "average")

Convert traits to a color representation: white means low, red means high, grey means missing entry:

traitColors = numbers2colors(datTraits[5:13], signed = FALSE);

Plot the sample dendrogram and the colors underneath:

plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits[5:13]),
main = "Sample dendrogram and trait heatmap")

Again: white means a low value, red means a high value, and gray means missing entry. This is just initial stuff… we haven’t looked at modules of OTUs that occur across samples yet.

Save:

save(datExpr0, datTraits, file = "Monterey-dataInput.RData")

Code–Network Analysis:

Allow multi-threading within WGCNA. This helps speed up certain calculations.
Any error here may be ignored but you may want to update WGCNA if you see one.

options(stringsAsFactors = FALSE);
enableWGCNAThreads()

Load the data saved in the first part:

lnames = load(file = "Monterey-dataInput.RData");

The variable lnames contains the names of loaded variables:

lnames

Note: You have a couple of options for how you create your weighted OTU co-expression network. I went with the step-by-step construction and module detection. Please see this document for information on the other methods (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/Simulated-05-NetworkConstruction.pdf).

Choose a set of soft-thresholding powers:

powers = c(c(1:10), seq(from = 11, to=30, by=1))

Call the network topology analysis function:
Note: I am using a signed network because it preserves the sign of the connection (whether nodes are positively or negatively correlated); this is recommendation by authors of WGCNA.

sft = pickSoftThreshold(datExpr0, powerVector = powers, verbose = 5, networkType = "signed")

Output:

pickSoftThreshold: will use block size 750.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 750 of 750
   Power SFT.R.sq slope truncated.R.sq  mean.k. median.k. max.k.
1      1   0.0299  1.47          0.852 399.0000  400.0000 464.00
2      2   0.1300 -1.74          0.915 221.0000  221.0000 305.00
3      3   0.3480 -2.34          0.931 128.0000  125.0000 210.00
4      4   0.4640 -2.41          0.949  76.3000   73.1000 150.00
5      5   0.5990 -2.57          0.966  47.2000   44.0000 111.00
6      6   0.7010 -2.52          0.976  30.1000   27.1000  83.40
7      7   0.7660 -2.47          0.992  19.8000   17.2000  64.30
8      8   0.8130 -2.42          0.986  13.3000   11.0000  50.30
9      9   0.8390 -2.34          0.991   9.2200    7.1900  40.00
10    10   0.8610 -2.24          0.992   6.5200    4.8800  32.20
11    11   0.8670 -2.19          0.987   4.7000    3.3700  26.20
12    12   0.8550 -2.18          0.959   3.4600    2.3300  21.50

This is showing you the power (soft thresholding value), the r2 for the scale independence for each particular power (we shoot for an r2 higher than 0.8), the mean number of connections each node has at each power (mean.k), the median number of connections/node (median.k), and the maximum number of connections (max.k).

Plot the results:

sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;

Scale-free topology fit index (r2) as a function of the soft-thresholding power:

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");

This line corresponds to using an R^2 cut-off of h:

abline(h=0.8,col="red")

Mean connectivity as a function of the soft-thresholding power:

plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

I picked a soft thresholding value of 10 because it was well above an r2 of 0.8 (it is a local peak for the r2) and the mean connectivity is still above 0.

So now we just calculate the adjacencies, using the soft thresholding power of 10:

softPower = 10;
adjacency = adjacency(datExpr0, power = softPower, type = "signed");

Then we transform the adjacency matrix into a Topological Overlap Matrix (TOM) and calculate corresponding dissimilarity:

Remember: The TOM you calculate shows the topological similarity of nodes, factoring in the connection strength two nodes share with other “third party” nodes. This will minimize effects of noise and spurious associations:

TOM = TOMsimilarity(adjacency, TOMType = "signed");
dissTOM = 1-TOM

Create a dendogram using a hierarchical clustering tree and then call the hierarchical clustering function:

TaxaTree = hclust(as.dist(dissTOM), method = "average");

Plot the resulting clustering tree (dendrogram):

sizeGrWindow(12,9)
plot(TaxaTree, xlab="", sub="", main = "Taxa clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04);

This image is showing us the clustering of all 750 OTUs based on the TOM dissimilarity index.

Now you have to decide the optimal module size for your system and should play around with this value a little. I wanted relatively large module so I set the minimum module size relatively high at 30:

minModuleSize = 30;

Module identification using dynamic tree cut (there a couple of different ways to figure out your modules and so you should explore what works best for you in the tutorials by the authors):

dynamicMods = cutreeDynamic(dendro = TaxaTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize);
table(dynamicMods)

Convert numeric labels into colors:

dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)

dynamicColors
    black      blue     brown     green       red turquoise    yellow 
       49       135       113        71        64       216       102

You can see that there are a total of 7 modules (you should have seen that above too) and that now each module is named a different color. The numbers under the colors tells you how many OTUs were sorted into that module. Each OTU is in exactly 1 module, and you can see that if you add up all of the numbers from the various modules you get 750 (the number of OTUs that we limited our analysis to above).

Plot the dendrogram with module colors underneath:

sizeGrWindow(8,6)
plotDendroAndColors(TaxaTree, dynamicColors, "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Taxa dendrogram and module colors")

Now we will quantify co-expression similarity of the entire modules using eigengenes and cluster them based on their correlation:
Note: An eigengene is 1st principal component of a module expression matrix and represents a suitably defined average OTU community.

Calculate eigengenes:

MEList = moduleEigengenes(datExpr0, colors = dynamicColors)
MEs = MEList$eigengenes

Calculate dissimilarity of module eigengenes:

MEDiss = 1-cor(MEs);

Cluster module eigengenes:

METree = hclust(as.dist(MEDiss), method = "average");

Plot the result:

sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")

Now we will see if any of the modules should be merged. I chose a height cut of 0.30, corresponding to a similarity of 0.70 to merge:

MEDissThres = 0.30

Plot the cut line into the dendrogram:

abline(h=MEDissThres, col = "red")

You can see that, according to our cutoff, none of the modules should be merged.

If there were some modules that needed to be merged you can call an automatic merging function:

merge = mergeCloseModules(datExpr0, dynamicColors, cutHeight = MEDissThres, verbose = 3)

The merged module colors:

mergedColors = merge$colors;

Eigengenes of the new merged modules:

mergedMEs = merge$newMEs;

If you had combined different modules then that would show in this plot:

sizeGrWindow(12, 9)

plotDendroAndColors(TaxaTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)

If we had merged some of the modules that would show up in the Merged dynamic color scheme.

Rename the mergedColors to moduleColors:

moduleColors = mergedColors

Construct numerical labels corresponding to the colors:

colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)-1;
MEs = mergedMEs;

Save module colors and labels for use in subsequent parts:

save(MEs, moduleLabels, moduleColors, TaxaTree, file = "Monterey-networkConstruction-stepByStep.RData")

Code–Relating modules to external information and IDing important taxa:

Here you are going to identify modules that are significantly associate with environmental traits/biogeochemical rates. You already have summary profiles for each module (eigengenes–remeber that an eigengene is 1st principal component of a module expression matrix and represents a suitably defined average OTU community), so we just have to correlate these eigengenes with environmental traits and look for significant associations.

Defining numbers of OTUs and samples:

nTaxa = ncol(datExpr0);
nSamples = nrow(datExpr0);

Recalculate MEs (module eigengenes):

MEs0 = moduleEigengenes(datExpr0, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

Now we will visualize it:

sizeGrWindow(10,6)

textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));

labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))

Each row corresponds to a module eigengene and each column corresponds to an environmental trait or biogeochemical rate (as long as it is continuous–notice that the categorical variables are gray and say NA). Each cell contains the corresponding Pearson correlation coefficient (top number) and a p-value (in parentheses). The table is color-coded by correlation according to the color legend.

You can see that the Brown module is positively correlated with many indices of upwelling while the Black module is negatively correlated with many indices of upwelling. For this work I was particularly interested in CR and so I focused on modules the positively or negatively correlated with CR. The Red module was negatively associated with CR while the Blue module was positively associated with CR.

Let’s look more at the Red module by quantifying the associations of individual taxa with CR:

First define the variable we are interested in from datTrait:

CR = as.data.frame(datTraits$CR);
names(CR) = "CR"

modNames = substring(names(MEs), 3)
TaxaModuleMembership = as.data.frame(cor(datExpr0, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(TaxaModuleMembership), nSamples));
names(TaxaModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
TaxaTraitSignificance = as.data.frame(cor(datExpr0, CR, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(TaxaTraitSignificance), nSamples));
names(TaxaTraitSignificance) = paste("GS.", names(CR), sep="");
names(GSPvalue) = paste("p.GS.", names(CR), sep="");

module = "red"
column = match(module, modNames);
moduleTaxa = moduleColors==module;
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(TaxaModuleMembership[moduleTaxa, column]),
abs(TaxaTraitSignificance[moduleTaxa, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Taxa significance for CR",
main = paste("Module membership vs. Taxa significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

This graph shows you how each taxa (each red dot is an OTU that belongs in the Red module) correlated with 1) the Environmental trait of interest and 2) how important it is to the module. The taxa/OTUs that have high module membership tend to occur whenever the module is represented in the environment and are therefore often connected throughout the samples with other red taxa/OTUs. In this module, these hubs (Red OTUs that occur with other Red OTUs) are also the most important OTUs for predicting CR.

Now lets get more info about the taxa that make up the Red module:

First, merge the statistical info from previous section (modules with high assocation with trait of interest–e.g. CR or Temp) with taxa annotation and write a file that summarizes these results:

names(datExpr0)
names(datExpr0)[moduleColors=="red"]

You will have to feed in an annotation file–a file listing what Bacteria/Archaea go with each OTU (I am not providing you will this file, but it just had a column with OTUs and a column with the Taxonomy).

annot = read.table("MB.subsample.fn.0.03.cons.taxonomy",header=T,sep="\t");
dim(annot)
names(annot)
probes = names(datExpr0)
probes2annot = match(probes, annot$OTU)

Check for the number or probes without annotation (it should return a 0):

sum(is.na(probes2annot))

Create the starting data frame:

TaxaInfo0 = data.frame(Taxon = probes,
TaxaSymbol = annot$OTU[probes2annot],
LinkID = annot$Taxonomy[probes2annot],
moduleColor = moduleColors,
TaxaTraitSignificance,
GSPvalue)

Order modules by their significance for weight:

modOrder = order(-abs(cor(MEs, CR, use = "p")));

Add module membership information in the chosen order:

for (mod in 1:ncol(TaxaModuleMembership))
{
oldNames = names(TaxaInfo0)
TaxaInfo0 = data.frame(TaxaInfo0, TaxaModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(TaxaInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}

Order the OTUs in the geneInfo variable first by module color, then by geneTraitSignificance:

TaxaOrder = order(TaxaInfo0$moduleColor, -abs(TaxaInfo0$GS.CR));
TaxaInfo = TaxaInfo0[TaxaOrder, ]

Write file:

write.csv(TaxaInfo, file = "TaxaInfo.csv")

Here is a bit of the output file I got:

	Taxon	TaxaSymbol	LinkID	moduleColor	GS.TotalRate	p.GS.TotalRate	MM.red	p.MM.red	MM.blue	p.MM.blue	MM.green	p.MM.green	MM.brown	p.MM.brown	MM.turquoise	p.MM.turquoise	MM.black	p.MM.black	MM.yellow	p.MM.yellow
Otu00711	Otu00711	Otu00711	Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);Surface_4(100);	black	0.461005	0.00111	0.005028	0.973244	0.243888	0.098526	-0.07719	0.606075	-0.25274	0.086535	0.058878	0.694233	0.502027	0.000324	0.132412	0.374947
Otu00091	Otu00091	Otu00091	Bacteria(100);Bacteroidetes(100);Flavobacteriia(100);Flavobacteriales(100);Flavobacteriaceae(100);Formosa(100);	black	0.378126	0.008778	-0.17243	0.246462	0.446049	0.001676	0.34467	0.017667	-0.55057	6.08E-05	0.492517	0.000437	0.615168	4.20E-06	0.367211	0.011115
Otu00082	Otu00082	Otu00082	Bacteria(100);Bacteroidetes(100);Flavobacteriia(100);Flavobacteriales(100);Flavobacteriaceae(100);NS5_marine_group(100);	black	-0.35649	0.013911	0.222515	0.132755	-0.06428	0.667734	0.175654	0.237601	-0.45502	0.001312	0.421756	0.003151	0.750195	1.28E-09	0.126362	0.397349
Otu00341	Otu00341	Otu00341	Bacteria(100);Bacteroidetes(100);Cytophagia(100);Cytophagales(100);Flammeovirgaceae(100);Marinoscillum(100);	black	-0.28242	0.054435	0.023927	0.873162	-0.07555	0.613762	0.144688	0.331879	-0.03144	0.833838	0.035147	0.814565	0.209255	0.158058	-0.06565	0.661083
Otu00537	Otu00537	Otu00537	Bacteria(100);Verrucomicrobia(100);Verrucomicrobiae(100);Verrucomicrobiales(100);Verrucomicrobiaceae(100);Persicirhabdus(100);	black	-0.23668	0.109211	0.123673	0.40755	-0.17362	0.243171	-0.05738	0.701628	-0.26399	0.072961	0.264411	0.072493	0.425082	0.002897	0.040045	0.789278
Otu00262	Otu00262	Otu00262	Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);Surface_1(100);Candidatus_Pelagibacter(90);	black	-0.23615	0.110023	0.327396	0.02468	-0.22748	0.12411	-0.13779	0.355699	-0.23708	0.108594	0.271968	0.064409	0.554592	5.23E-05	0.036113	0.809563
Otu00293	Otu00293	Otu00293	Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);SAR11_clade_unclassified(100);	black	0.223427	0.131133	0.142016	0.34098	0.209327	0.157912	0.234713	0.112274	-0.53032	0.000126	0.529907	0.000128	0.627937	2.30E-06	0.390714	0.006621
Otu00524	Otu00524	Otu00524	Bacteria(100);Actinobacteria(100);Acidimicrobiia(100);Acidimicrobiales(100);OM1_clade(100);Candidatus_Actinomarina(100);	black	-0.20559	0.165629	0.28312	0.053809	0.016758	0.910982	0.148756	0.318316	-0.34758	0.016669	0.376043	0.009188	0.494903	0.000406	0.377597	0.00888
Otu00370	Otu00370	Otu00370	Bacteria(100);Verrucomicrobia(100);Opitutae(100);MB11C04_marine_group(100);MB11C04_marine_group_unclassified(100);	black	-0.20397	0.169074	0.303984	0.037771	-0.06655	0.656707	0.009401	0.949991	-0.25451	0.084275	0.097595	0.514	0.642409	1.13E-06	0.224711	0.128875

NOTES on output:

moduleColor is the module that the OTU was ultimately put into

GS stands for Gene Significance (for us it means taxon significance) while MM stands for module membership.

GS.Environmentaltrait = Pearson Correlation Coefficient for that OTU with the trait. GS allows incorporation of external info into the co-expression network by showing gene/OTU significance. The higher the absolute value of GS the more biologically significant the gene (or in our case taxa) to that external variable (e.g. CR).
p.GS.Environmentaltrait = P value for the preceding relationship.

MM.color = Pearson Correlation Coefficient for Module Membership–i.e. how well that OTU correlates with that particular color module (each OTU has a value for each module but only belongs to one module). If close to 0 or negative then the taxa is not part of that color module (since each OTU has to be put in a module you may get some OTUs that are close to 0, but they aren’t important to that module). If it is close to 1 then it is highly connected to that color module, but will be placed with the color module that it is most connected to throughout the samples.
p.MM.color = P value for the preceding relationship.

Modules will be ordered by their significance for the external variable you selected (e.g. CR), with the most significant ones to the left.
Each of the modules (with each OTU assigned to exactly one module) will be represented for the environmental trait you selected.
You will have to rerun this for each environmental trait you are interested in.

Tutorial: How to make a map using QGIS

Natalia Erazo — Mon, 29 Jan 2018 21:31:12 +0000

Hi! I’m Natalia Erazo, currently working on the Ecuador project aimed at examining biogeochemical processes in mangrove forest. In this tutorial, we’ll learn the basics of (free) QGIS, how to import vector data, and make a map using data obtained from our recent field trip to the Ecological Reserve Cayapas Mataje in Ecuador! We’ll also learn standard map elements and QGIS function: Print Composer to generate a map.

Objectives:

I. Install QGIS

II. Learn how to upload raster data using the Plugin OpenLayers and QuickMap services.

III. Learn how to import vector data: import latitude, longitude data and additional data. Learn how to select attributes from the data e.g., salinity values and plot them.

IV. Make a map using Print Composer in QGIS.

I. QGIS- Installation

QGIS is a very powerful tool and user friendly open source geographical system that runs on linux, unix, mac, and windows. QGIS can be downloaded here . You should follow the instructions and install gdal complete.pkg, numpy.pkg, matplotlib.pkg, and qgis.pkg.

II.Install QGIS Plug-in and Upload a base map.

Install QGIS Plug-in

Go to Plugins and select Manage and Install plugins. This will open the plugins dialogue box and type OpenLayers Plugin and click on Install plugin.

This plugin will give you access to Google Maps, openStreet map layers and others, and it is very useful to make quick maps from Google satellite, physical, and street layers. However, the OpenLayers plugin could generate zoom errors in your maps. There is another plug in: Quick Map Service which uses tile servers and not the direct api for getting Google layers and others. This is a very useful plugin which offers more options for base maps and less zoom errors. To install it you should follow the same steps as you did for the OpenLayers plugin except this time you’ll type QuickMap Service and install the plugin.

Also, If you want to experiment with QuickMap services you can expand the plugin: Go to Web->Quick Map Services->Settings->More services and click on get contributed pack. This will generate more options for mapping.

2. Add the base layer Map:

I recommend playing with the various options in either OpenLayers like the Google satellite, physical, and other maps layers, or QuickMap Service.

For this map, we will use ESRI library from QuickMap services. Go to–> Web- ->QuickMapServices–> Esri–> ESRI Satellite

You should see your satellite map.

You can click on the zoom in icon to adjust the zoom, as shown in the map below where I zoom in the Galapagos Islands. You’ll also notice that on the left side you have a Layers panel box, this box shows all the layers you add to your map. Layers can be raster data or vector data, in this case we see the layer: ESRI Satellite. At the far left you’ll see a list of icons that are used to import your layers. It is important to know what kind of data you are importing to QGIS to use the correct function.

III. Adding our vector data.

We will now add our data file which contains latitude and longitude of all the sites we collected samples, in addition to values for salinity, temperature, and turbidity. You can do this with your own data by creating a file in excel and have a column with longitude and latitude values and columns with other variables and save it as a csv file. To input data you’ll go to the icons on the far left and click on “Add Delimited Text Layer”. Or you can click on Layer-> Add Layer-> Add Delimited Text Layer.

You’ll browse to the file with your data. Make sure that csv is selected for File format. Additionally, make sure that X field represents the column for your longitude points and Y field for latitude. QGIS is smart enough to recognize longitude and latitude columns but double check! You can also see an overview of the data with columns for latitude, longitude, Barometer mmHg, conductivity, Salinity psu and other variables. You can leave everything else as default and click ok.

You’ll be prompt to select the coordinate reference system selector, and this is very important because if you do not select the right one you’ll get your points in the wrong location. For GPS coordinates, as the data we are using here, you need to select WGS 84 ESPG 43126.

Now we can see all the points where we collected data!

As we saw earlier, the data contains environmental measurements such as: salinity, turbidity, temperature and others. We can style the layer with our sampling points based on the variables of our data. In this example we will create a layer representing salinity values.

You’ll right click on the layer with our data in the Layer Panel, in this case our layer: 2017_ecuador_ysi_dat.. and select properties.

The are many styles you can choose for the layer and the styling options are located in the Style tab of the Properties dialogue. Clicking on the drop-down bottom in the Style dialogue, you’ll see there are five options available: Single Symbol, Categorized, Graduated, Rule Based and Point displacement. We’ll use Graduated which allows you to break down the data in unique classes. Here we will use the salinity values and will classify them into 3 classes: low, medium, and high salinity. There are 5 modes available in the Graduated style to do this: Equal interval, Quantile, Natural breaks, Standard deviation and Pretty breaks. You can read more about these options in qgis documentation.

In this tutorial, for simplicity we’ll use the Quantile option. This method will decide the classes such that number of values in each class are the same; for example, if there are 100 values and we want 4 classes, the quantile method decide the classes such that each class will have 25 values.

In the Style section: Select->Graduated, in Column->salinity psu, and in color ramp we’ll do colors ranging from yellow to red.

In the classes box write down 3 and select mode–>Quantile. Click on classify, and QGIS will classify your values in different ranges.

Now we have all the data points color in the 3 different ranges: low, medium, and high salinity.

However, we have a lot of points and it is hard to visualize the data points. We can edit the points by right clicking on the marker points and select edit symbol.

Now, I am going to get rid of the black outline to make the points easy to visualize. Select the point by clicking on Simple Marker and in Outline style select the No Pen. Do the same for the remaining two points.

Nice, now we can better see variations in our points based on salinity!

IV. Print Composer: making a final map

We can start to assemble the final version of our map. QGIS has the option to create a Print composer where you can edit your map. Go to Project -> New Print composer

You will be prompted to enter a title for the composer, enter the title name and hit ok. You will be taken to the Composer window.

In the Print composer window, we want to bring the map view that we see in the QGIS canvas to the composer. Go to Layout-> Add a Map. Once the Add map button is active, hold the left mouse and drag a rectangle where you want to insert the map. You will see that the rectangle window will be rendered with the map from the main QGIS canvas.

You can see in the far right end the Items box; this shows you the map you just added. If you want to make changes, you’ll select the map and edit it under item properties. Sometimes it is useful to edit the scale until you are happy with the map.

We can also add a second map of the location of Cayapas Mataje in South America as a geographic reference. Go to the main qgis canvas and zoom out the map until you can see where in South America the reserve is located.

Now go back to Print Composer and add the map of the entire region. You’ll do the same as with the first map. Go to Layout–> Add map. Drag a rectangle where you want to insert the map. You will see that the rectangle window will be rendered with the map from the main QGIS canvas. In Items box, you can see you have Map 0 and Map 1. Select Map 1, and add a frame under Item properties, click on Frame to activate it and adjust the thickness to 0.40mm.

We can add a North arrow to the map. The print composer comes with a collection of map related images including many North arrows. Click layout–> add image.

Hold on the left mouse button, draw a rectangle on the top-right corner of the map canvas.

On the right-hand panel, click on the Item Properties tab and expand the Search directories and select the north arrow image you like the most. Once you’ve selected your image, you can always edit the arrow under SVG parameters.

Now we’ll add a scale bar. Click on Layout–> Add a Scale bar. Click on the layout where you want the scale bar to appear. Choose the Style and units that fit your requirement. In the Segments panel, you can adjust the number of segments and their size. Make sure Map 0 is selected under main properties.

I’ll add a legend to the map. Go to Layout–> add a Legend. Hold on the left mouse button, and draw a rectangle on the area you want the legend to appear. You can make any changes such as adding a title in the item properties, changing fonts and renaming your legend points by clicking on them and writing the text you want.

It’s time to label our map. Click on Layout ‣ Add Label. Click on the map and draw a box where the label should be. In the Item Properties tab, expand the Label section and enter the text as shown below. You can also make additional changes to your font, size by editing the label under Appearance.

Once you have your final version, you can export it as Image, PDF or SVG. For this tutorial, let’s export it as an image. Click Composer ‣ Export as Image.

Here is our final map!

Now you can try the tutorial with your own data. Making maps is always a bit challenging but put your imagination to work!

Here is a list of links that could help with QGIS:

-QGIS blog with various tutorials and new info on functions to use: here.

-If you want more information on how QGIS handles symbol and vector data styling: here is a good tutorial.

-If you need data, a good place to start is Natural Earth: Free vector and raster basemap data used for almost any cartographic endeavor.

If you have specific questions please don’t hesitate to ask.

So you want to use your computer for science…

Jeff — Sun, 05 Nov 2017 22:04:21 +0000

It’s been a while since I was a new graduate student, and I’ve forgotten how little I knew about computers back then. I was reminded recently while teaching a couple of lab members how to use ffmpeg, an excellent command line tool for building animations from images (as described in this post). We got there, but I realized that we needed a basic computing tutorial before moving on to anything more advanced. If you’re trying to use your laptop for science, but you’re not too sure about this whole “command line thing”, this post’s for you. Be warned that this tutorial is intended as only the most cursory crash course to get you moving up the initial learning curve. A comprehensive list of commands for Bash on OSX can be found here. At the end of this tutorial you will have a basic grasp of how to:

Navigate the file system
Create and edit text files
Search text files
Create a shell script
Modify the PATH variable using startup files (.bash_profile)

Okay, so I want to get a computer to do some science. What should I get?

Whatever you want. It really doesn’t matter. Most grad students seem to get Macs these days, which I don’t love (they’re costly and epitomize form over function), but have the slight advantage of a Unix-like environment hiding behind all that OSX gibberish. I use a Windows machine (which I also don’t love, as it epitomizes dysfunction over function) with Cygwin, which gives me access to all the Linux tools that I need to carry out day-to-day operations. Windows 10 users can also make use of the Bash shell add-on for Windows, but I haven’t found any advantage to this over Cygwin. The point is that you need either a) A Mac, b) A Windows machine running Cygwin or the add-on, or c) A Linux machine. The command prompt and output given below are what you will see in OSX (faked since I’m not actually using a Mac), but are similar to what you will get with Cygwin or one of the Linux distros. The commands should work the same across all of these options.

Getting familiar with “Bash”

Bash stands for Bourne-again shell, which you can read all about in many other places on the web. Bash is a very powerful tool for manipulating your computer’s file system, executing programs, and even creating programs. It is a cornerstone of scientific computing and you should have at least some passing familiarity with it. To open the Bash terminal in OSX, go to Applications/Utilities/Terminal in Finder. A mysterious black (or white) window will open, with a white (black) cursor waiting for YOU. Type “pwd” for print working directory and hit “Enter”.

jeffscomputer:~ jeff$ pwd
/Users/jeff

Bash will respond with your location, which should be something along the lines of /user/home. Next type “ls” to list the contents of the directory.

jeffscomputer:~ jeff$ ls
bin
Desktop
Documents
Downloads

We can use the command “cd” to change directories. For example, if you want to move to your Desktop directory type “cd Desktop”.

jeffscomputer:~ jeff$ cd Desktop
jeffscomputer:Desktop jeff$ pwd
/Users/jeff/Desktop

Because the directory “Desktop” was just one level down, in this case the relative path “Desktop” is equivalent to typing the full path “/Users/jeff/Desktop”. Here’s a useful tip. From any location on your system “cd ~” will get you back to your home directory.

jeffscomputer:Desktop jeff$ cd ~
jeffscomputer:~ jeff$ pwd
/Users/jeff

Now let’s create a directory to do some work. The command for this is “mkdir temp”, for “make directory with name temp.”

jeffscomputer:~ jeff$ mkdir temp
jeffscomputer:~ jeff$ cd temp

Now move into that directory. You already know how

Creating and editing text files

You will frequently need to create and edit basic text files without all the fancy formatting of a word processing document. The most user friendly way to do this is with the program nano, which is likely already present if you are using OSX or Cygwin. Type “nano temp.txt” and nano will open a blank text file with name temp.txt.

jeffscomputer:temp jeff$ nano temp.txt

Type a couple lines of text and, when you’re ready to exit and save the file, hit ctrl-x. Nano will prompt you about saving the output, hit yes. List the contents of the directory and notice that the file temp.txt now exists. Type “nano temp.txt” again and rather than create a new file, nano will open temp.txt for editing.

Having gone through the trouble of creating that file, let’s go ahead and delete it using the remove or “rm” command.

jeffscomputer:temp jeff$ rm temp.txt

To do some fancier things with files lets download one that has a little more information in it. There are two programs to use to fetch files online, wget and curl. I find wget much easier to use than curl. If you’re using Cygwin or Linux you likely already have it, but for OSX you first need to install a package manager, which is a whole can of worms that I’m not going to tackle in this post. So let’s use curl to download a text file, in this case “The Rime of the Ancient Mariner” by Samuel Coleridge. The basic download command for curl is:

jeffscomputer:temp jeff$ curl -O https://www.polarmicrobes.org/extras/ancient_mariner.txt

This should create the file “ancient_mariner.txt” in your working directory (e.g., “/Users/jeff/temp”).

Viewing file content

The reason we downloaded this more complex text file (it’s a pretty long poem) is to simulate a longer data file than our “temp.txt” file. Very often in scientific computing you have text files with hundreds, thousands, or even millions of lines. Just opening such a file can be onerous, let alone finding some specific piece of information. Fortunately there are tools that can help. Type “head ancient_mariner.txt”, this returns the top 10 lines of the file.

jeffscomputer:temp jeff$ head ancient_mariner.txt
PART THE FIRST
It is an ancient mariner,
And he stoppeth one of three.
"By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
"The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din."
He holds him with his skinny hand,

Want to guess what “tail” does? For either command you can use a “flag” to modify behavior, such as returning more lines. Flags are always preceded by “-” or “–“, and generally come before the positional arguments of the command, in this case the file. This is general syntax for Unix commands, and does not apply only to “head” and “tail”.

jeffscomputer:temp jeff$ tail -15 ancient_mariner.txt
Both man and bird and beast.
He prayeth best, who loveth best
All things both great and small;
For the dear God who loveth us,
He made and loveth all.
The Mariner, whose eye is bright,
Whose beard with age is hoar,
Is gone: and now the Wedding-Guest
Turned from the bridegroom's door.
He went like one that hath been stunned,
And is of sense forlorn:
A sadder and a wiser man,
He rose the morrow morn.

In the examples so far the the command output has printed to the screen, but what if we want to capture it in a file? For example, what if we have a huge data file (millions of rows) and we want just the top few lines to test some code or share with a collaborator? This is easily done by redirecting the output using “>”.

jeffscomputer:temp jeff$ tail -15 ancient_mariner.txt > end_of_ancient_mariner.txt

Searching file content

To find specific information in a file use the command “grep”. Without launching into a full-on explanation of regular expressions, grep (very quickly) finds lines that match some given pattern. You can either count or view the lines that match. To find the lines with the word “albatross” in ancient_mariner.txt:

jeffscomputer:temp jeff$ grep 'Albatross' ancient_mariner.txt
At length did cross and Albatross,
I shot the Albatross."
Instead of the cross, the Albatross
The Albatross fell off, and sank
The harmless Albatross."
The Albatross's blood.

Seems there’s a typo in this version of the poem (and|an)! At any rate, use the -c flag to count the lines. Protip: use the “up” key to bring back the previous line, which you can then modify.

jeffscomputer:temp jeff$ grep -c 'Albatross' ancient_mariner.txt
6

Use the -v flag to select against lines with “Albatross”, which you can combine with the -c or other flags:

jeffscomputer:temp jeff$ grep -cv 'Albatross' ancient_mariner.txt
634

Build a basic program

So far we’ve been executing commands manually, from the command line. Suppose we have a set of commands that we want to execute a lot, or that need a method to document our workflow. To do this we create a shell script. Fire up nano for a file named “temp.sh” and type:

#!/bin/bash

echo "hello world!"
# hey, this line doesn't do anything!
f=ancient_mariner.txt
echo $f
grep -cv 'Albatross' $f

Line by line here’s what’s going on:

The first line is called the shebang and it tells your system what interpreter to use to run the script (in this case Bash). /bin/bash is an actual location on your computer where the Bash program resides
The next line is just a bit of formality; by strictly adhered-to convention your first program should always be a little script that says “hello world!”. It does however, illustrate the “echo” command, which prints out information.
The next line starts with “#” which denotes a comment. Line that start with “#” are not read by Bash. You can use that character to make notes, or to toggle commands on and off.
In the next line we assign a variable (f) a value (ancient_mariner.txt). We can now call that variable using “$”. The next two lines are examples of this.

To execute the script we simply type the file name into bash. Before we do that however, we need to set the file permissions, as files are not by fault executable. To do that we use the “chmod” command with the “a+x” options (note that this is not a flag).

jeffscomputer:temp jeff$ chmod a+x temp.sh

You can run it now, but there is one final trick. Bash doesn’t know to look in your working directory for the script, you have to specify that that’s where it is. The location of the current working directory is always “./”, so the command looks like this:

jeffscomputer:temp jeff$ ./temp.sh
hello world!
ancient_mariner.txt
634

Setting up your environment

Okay, we’re going to ramp things up a bit for the grand finale and modify the Bash startup files to better use your new-found skills. There are several possible startup files, and the whole startup file situation gets a bit confusing. We will modify the .bash_profile file, which will handle the majority of user cases, but you should take the time to familiarize yourself with the different files. The .bash_profile file is a hidden file (denoted by the .), you can see hidden files by using “ls” with the “-a” flag.

jeffscomputer:temp jeff$ cd ~
jeffscomputer:~ jeff$ ls -a
bin
Desktop
Documents
Downloads
.bash_history
.bashrc
.profile
.ssh

Don’t worry if you don’t see .bash_profile, we will create it shortly. First, to understand why we need to modify the startup file in the first place, from your home directory try executing the shell script that we just created.

jeffscomputer:~ jeff$ temp.sh
-bash: temp.sh: command not found

The command would execute if you typed temp/temp.sh, but remembering the location of every script and program so that you can specify the complete path to it would be silly. Instead, Bash stores this information in a variable named PATH. To see what’s in PATH use “echo”:

jeffscomputer:~ jeff$ echo $PATH
/usr/local/bin:/usr/bin:/usr/sbin

If you created a script in /usr/local/bin, Bash would know to look there and would find the script. It’s a good idea however, to keep user-generated scripts in the home directory to avoid cluttering up locations used by the operating system. What we need to do is automatically update PATH with customized locations whenever we start a new Bash session. We accomplish this by modifying PATH on startup using the startup file. To do this use nano, but you will need to use nano as a super user or “sudo”.

jeffscomputer:~ jeff$ sudo nano .bash_profile

As you already know, if you don’t have .bash_profile this will create it for you. Now, at the end of the .bash_profile file (or on the first line, if its empty), type (replacing “jeff” with your home directory):

export PATH=/Users/jeff/temp:$PATH

The command structure might look opaque at first but its really not. This line is saying “export the variable PATH as this text, followed by the original PATH variable”.

Close nano. Recall that .bash_profile is read by Bash at the start of the Bash session. Your newly defined PATH variable will be read if you start a new session, but you can also force Bash to read it with the “source” command.

jeffscomputer:~ jeff$ source .bash_profile

Now try to execute your bash script.

jeffscomputer:~ jeff$ temp.sh
hello world!
ancient_mariner.txt
634

Last, let’s clean things up a little. Previously we used “rm” to remove a file. The same command with the -r flag will remove a directory.

jeffscomputer:~ jeff$ rm -r temp

Voila! To keep your .bash_profile file looking pretty, don’t forget to remove the line adding temp to PATH (though it does no harm). Or you can comment that line out and leave it as an example of the correct syntax for when you next add a new location to PATH.