A new model for scientific careers

Traditionally there were three ways that, having acquired a PhD, a researcher could continue with their scientific career:

1)  Seek a faculty job at an academic institution (typically a university).  University jobs are wonderful in that they allow for complete creativity.  No one but the faculty member sets his or her research agenda.  The downside is relatively low pay and very little chance at reasonable work-life balance.  The general divestment in science and education, particularly at the state level, means that faculty members have to raise an increasingly large portion of their lab operating expenses (including salary) from an increasingly small pool of federal funds.  Most new faculty members find themselves in their mid-30s, holding advanced degrees, and working very long hours for something south of $100,000 a year (except at the most prestigious and well-endowed institutions).  Such is the price of freedom.

2)  Find a government position.  The government does, with some reluctance, support a fair amount of mission driven basic science at various agencies.  Of particular note are NASA (and its quasi-government affiliates), the DOE, and NOAA.  Permanent government research positions generally offer better pay than academia and better work life balance.  Although the science pursued by government agencies is mission directed, there is a certain amount of flexibility.  Government scientists after all, determine how a research or monitoring mission is implemented at the agency level.  Despite these benefits there are some real drawbacks to conducting research at a federal agency.  The bureaucracy is inflexible (and efforts to reign in spending – generally not an issue at research minded agencies, only creates more) and subject to the drift of political whims.  Many federal researchers are impaired by travel restrictions, work stoppages (which even disallow work to continue for free, from a researcher’s home), and budget woes.  Your salary might be guaranteed, but what’s the point if your employer prevents you from doing interesting work and doing it well?

3)  Get a job in industry.  In academic circles this is viewed as going over to the dark side.  The pay is good, the work life balance is great, but the academic freedom is nonexistent.  Gone are the days of the Bell Labs, when industry supported basic research as a means of cultivating innovation (that job now falls to the federal government, through NSF and NIH).

So what’s a young researcher to do?  There is a fourth option available, that until recently was pursued by only a few researchers with either independent wealth or the discipline required to be their own administrator; the cottage research industry.  Particularly ideal for researchers with low infrastructure requirements (this won’t work if you need a Level 4 biosafety lab), in the cottage industry model you incorporate your own non-profit institution and set up your research environment in the basement (or local coffee shop).  So long as you can leverage a little federal funding (i.e. continue to write competitive research proposals) you can operate with a fantastic degree of flexibility.  There are downsides of course, foremost among them being your long term financial security and your ability to stay connected with your peers and on the cutting edge.

What is changing now is the emergence of virtual institutions that can take on the administrative role for cottage-industry researchers, and help them maintain the kind of collaborations that really make cutting edge research possible.  One of these institutions is the Blue Marble Space Institute, brainchild of Sanjoy Som, a graduate of the Astrobiology Program (via the Atmospheric Sciences PhD track) here at the University of Washington.  There is a vetting process to join Blue Marble but membership is non-competitive, anyone with legitimate research credentials can join.  Members of the institute can submit proposals to private and public funding sources, benefit from internal proposal peer review, and enjoy the kind of collaboration possible at a traditional academic institution.  Overhead (the amount that an institution adds to the proposal budget to cover “administrative costs”) is a fraction of what it is at a university, helping Blue Marble proposals look more competitive.

Blue Marble and similar virtual institutions have the potential to dramatically alter the academic career landscape.  I anticipate submitting an application to Blue Marble in the near future, membership will enable me to submit proposals in parallel with potential postdoctoral advisors (graduate students and even postdocs are typically not allowed to submit proposals to federal programs).  If I’m unable to obtain a postdoctoral appointment it also lets me keep my foot in the door, refining and submitting proposals until there’s a break.  I think the real advantage to Blue Marble however, is its ability to empower researchers who might opt out of the traditional track.  I’m thinking specifically of researchers (male or female) who want to raise a family and have an academic career.  Graduate students often romantically partner with other graduate students, a pairing which almost guarantees future family/career conflicts between two career minded individuals.  Blue Marble is a new tool in the toolkit for building an academic family, providing a mechanism for taking some time away from a brick and mortar institution while staying totally in the game.

And while I’m on the subject, Sanjoy has a second goal… the adoption of a symbol of global unity for the International Space Station.  The ISS is after all the most visible example of the possibilities of international space collaboration for the public good.  You can learn more about that project, and vote to have the patch adopted through CASIS, here.

 

Posted in Uncategorized | Leave a comment

Risk in Research

The polar research community was reminded of the risks in polar research yesterday with the news that a helicopter crash had killed two Canadian Coast Guard officers, including the commander of the venerated research icebreaker CCGS Amundsen, and scientist Klaus Hochheim.  I’m at the biennial Polar and Alpine Microbiology Conference right now, and there were many sad reactions to the news.  I haven’t had the privilege of working with any of the crash victims, but many here have.  The news resulted in a long discussion over dinner last night on risk in field research.  The question that never seemed to get answered was how much risk is acceptable?

Very often in response to tragedy the answer is determined, with very little discussion, to be zero.  Two cases were presented at dinner where, following non-lethal accidents, unilateral decisions were made to reduce risk to zero.  These two cases are very different from the tragedy on Monday – the accidents occurred during non-essential activities and on ships belonging to UNOLS, the University National Oceanic Laboratory System (the US fleet of academic research vessels).  The policy changes from these incidents however, are indicative of a risk-adverse mindset among science managers.  The lack of conversation on what is an acceptable level of risk is troubling.  Advocating for zero risk is easy, but is impractical, costly, and threatens the quality of research.  A quick look at the incidents in question:

DISCLAIMER – I made an effort to find some documentation on these incidents but LexisNexis didn’t turn up anything useful.  the following is lore that has past down through the oceanographic community.

Incident 1: Shark bites girl, girl bites UNOLS

UNOLS vessels used to hold “swim calls” while in tropical waters.  Researchers on long cruises could take a quick dip after weeks in a cramped, hot vessel.  A pretty good perk.  Perks are good for moral.  Moral is good for work.  Some years ago (?) a swim call was held after a vessel had been stationary for some time, and after scraps of food were dumped overboard.  There was a shark.  Someone got bit.  They sued.  UNOLS banned swim calls.

Incident 2:  Get drunk once, never drink again

This incident happened in the 70’s and details are even sketchier.  UNOLS vessels used to allow alcohol, but after an inebriated crewmember was injured (and sued…) all alcohol was banned.  The US is unique among the major nations that conduct oceanographic research in having dry ships (expeditions on foreign vessels are very popular).

You might say so what?  In neither case did a ban really reduce the ability of the UNOLS fleet to conduct research.  I’ve even heard from some female colleagues that they favor the alcohol ban – weeks at sea with a majority male crew can be awkward enough without alcohol.  The problem is the instigation of rules disproportionate to the incidents, the mindset that seeks zero risk.  Perhaps this results from the reasonable desire of UNOLS to not be sued by an individual over the consequences of a choice he or she makes.  What is baffling is the rejection of the obvious alternative: inform individuals of the risks of field research, and then hold them accountable for the decisions they make in the field.

At various times I’ve worked as a divemaster, whitewater kayak instructor, and raft guide.  There is a small amount of risk involved in participating in any one of these leisure activities, and every year people die or are seriously injured doing them.  Despite this a sort of equilibrium has been reached – people are informed of the risk, acknowledge it in consent forms,  and participate.  There is no need to eliminate all the risk – we don’t restrict rafting tours to flat sections of river or scuba divers to water they can stand up in.  I’ve never once signed a liability waiver as a researcher, despite working in some very challenging environments.  Instead it was assumed that bureaucracy had rounded all the sharp corners, and that I was free to sue if I stubbed my toe on one they missed.

There is no way to know what policy changes will result from Monday’s incident.  It would be crass to speculate at a time when families and colleagues are coming to terms with the tragedy, and irresponsible when so few details are known.  We are all reminded that risk comes with the privilege of pursuing science outside the office.  I hope that as a community (and for that matter, as a nation) we can learn to approach the concept of risk rationally.

Posted in Uncategorized | Leave a comment

Halocarbons from young sea ice and frost flowers

An interesting paper come out recently by a group at the University of Gothenburg in Sweden.  Working at Svalbard, a common Arctic study site for European researchers, they measured the flux of organic compounds called halocarbons (or organohalides) from young sea ice and frost flowers.  Halocarbons are a class of chemicals defined by a halogen atom or atoms (fluorine, chlorine, bromine, or iodine) bound to a chain of carbon and hydrogen atoms.  The simplest halocarbons are the methylhalides, which contain only a single carbon.

Halogens are important reactive chemicals in the atmosphere, playing a role in ozone depletion (at ground level, which is good, and in the stratosphere, which is bad) and cloud condensation.  Halocarbons, especially the lower molecular weights ones like the methylhalides, are a source of halogens to the atmosphere.  When bound up in an organic molecule the halogen atoms are less reactive, and less prone to reacting with large, heavy molecules than when free.  This enables them to leave their site of origin (sea ice, seawater, leaf litter, a tree…).  Once in the atmosphere the light, unreactive halocarbons can be split apart via sunlight driven reactions, freeing the halogen as a reactive ion.

The concentration of common halocarbons in the atmosphere. Note the strong seasonal cycle for methylchloride, an indication that it is derived from photosynthetic organisms. Image from http://www.esrl.noaa.gov/gmd/hats.

Lots of people are interested in what produces the halocarbons that end up in the atmosphere.  There are many sources, including anthropogenic ones, but a major source is marine phytoplankton and algae.  Like terrestrial plants, these organisms produce oxygen during photosynthesis.  Oxygen, and the energy required to produce it, is dangerous stuff to have in a living cell.  Like gasoline in a car it can do a lot of damage if the stored energy gets released in the wrong way.  The damage oxygen does to cells is called oxidative stress, and cells stockpile antioxidants as a defense against it.  There are also certain enzymes that have antioxidant functions, including a class of enzymes called haloperoxidases that act on hydrogen peroxide (a highly oxidizing byproduct of photosynthesis).  As the name implies, halogen atoms play a role in the reaction catalyzed by haloperoxidases.  The product of the reaction is a halogen that is short an electron, making it particularly reactive.  It can react further with another molecule of hydrogen peroxide regenerating the original halogen ion.  Alternatively it can receive a carbon atom from another enzyme, making it a methylhalide.

The general reaction of haloperoxidase and hydrogen peroxide (Butler and Walker, 1993).

What does all the have to do with frost flowers and young sea ice?  Phytoplankton in the water column are preferentially entrained in young sea ice as the ice forms.  There, close to the light, they continue photosynthesis.  Halocarbons produced by their haloperoxidases are released into the brine network in sea ice rather than seawater.  As continued ice growth pushes this brine to the surface (some of which ends up in frost flowers) the halocarbons may end up in the atmosphere much quicker than those produced by phytoplankton in the water column.  The data seems to support this, halocarbon concentrations in young sea ice peak near the surface several days after ice formation (perhaps when the entrained phytoplankton become adapted to their new home and resume photosynthesis).  After several days however, halocarbon concentations at the ice surface are greatly diminished, possibly due to the loss of connectivity in the brine veins as the ice temperature decreases.

Bromoform, the most abundant halocarbon measured in the study, peaks close to the ice surface at around day 12 (Granfors et al., 2013).

There is also a possible bacteria source of halocarbons in young sea ice and frost flowers, tying this work to our recent publication in EMIR (Bowman and Deming, 2013).  There we reported the presence of an unusual community of bacteria that we hypothesize are symbionts of phytoplankton trapped in the ice.  Many of these bacteria belong to taxonomic groups known to produce halocarbons using haloperoxidases similar to those contained in phytoplankton.  They do this to alleviate the oxidative stress of their phytoplantkon partners, which enables the latter to undergo increased levels of photosynthesis (which ultimately benefits the bacteria).

We have not observed many phytoplankton within frost flowers or at the ice surface, conditions there are probably too hostile for them,  but there is an abundance of hydrogen peroxide and other oxidizing compounds produced by sunlight at the ice surface.  Perhaps the bacteria in frost flowers, for the short time they are able to survive, are engaged in their normal task of eliminating these harmful compounds and producing halocarbons.  Unfortunately we have not yet undertaken measurements of halocarbons and other biogenic compounds in parallel with analyses of the frost flower microbial community.  Until we do this we can only speculate!

Posted in Research | Leave a comment

Compositional vectors revisited

A couple of articles back I posted a down-and-dirty method for calculating compositional vectors in a whole genome.  I’m using compositional vectors in one of my projects so I reworked the code to make the analysis more rigorous, implementing the suggestion to use Python’s itertools function, shifting the alphabet to amino acid space, limiting the composition to products from open reading frames, and applying the correction from Hao and Qi, 2004.  The correction relies on the calculation of k – 1 and k – 2 length compositional vectors, which are the k1 and k2 sections in the following block.  The k section is the desired k length vector, and k0 is the normalized vector.

#### load some modules we will need ####

import os
import subprocess
import re
import gzip
import cPickle
from Bio import SeqIO
from joblib import Parallel, delayed

#### generate all possible kmers ####

k = 5
k1 = k - 1
k2 = k - 2

## k      

pros = list(itertools.repeat(pro, k))
bins = list(itertools.product(*pros))
new_bins = []    
for b in bins:
    b = ''.join(b)
    new_bins.append(b)
bins = new_bins
nmers = open(str(k)+'mers_pro.set', 'wb')                   
cPickle.dump(bins, nmers)
nmers.close()
itertools.product()

## k1

pros = list(itertools.repeat(pro, k1))
k1_bins = list(itertools.product(*pros))
k1_new_bins = []    
for b in k1_bins:
    b = ''.join(b)
    k1_new_bins.append(b)
k1_bins = k1_new_bins
k1_nmers = open(str(k1)+'mers_pro.set', 'wb')                   
cPickle.dump(k1_bins, k1_nmers)
k1_nmers.close()
itertools.product()    

## k2

pros = list(itertools.repeat(pro, k2))
k2_bins = list(itertools.product(*pros))
k2_new_bins = []    
for b in k2_bins:
    b = ''.join(b)
    k2_new_bins.append(b)
k2_bins = k2_new_bins
k2_nmers = open(str(k2)+'mers_pro.set', 'wb')                   
cPickle.dump(k2_bins, k2_nmers)
k2_nmers.close()
itertools.product()

Once all possible words have been identified for k, k1, and k2, I count their occurrence in the proteome and the normalize the count.  The code doesn’t show where “name” comes from, in my case I’ve already looped over the files in the directory and generates a list of file prefixes for the proteomes I want to analyze (code not shown).  I append ‘_pro.fasta’ to each prefix to get back to the actual file name.

#### find normalized kmers in proteome ####

def calc_vector(name, bins, k1_bins, k2_bins):
    k1_found_bins = {}
    k1_used_bins = set()
    k2_found_bins = {}
    k2_used_bins = set()
    found_bins = {}
    used_bins = set()

    seqs = name+'_pro.fasta'
    for record in SeqIO.parse(seqs, 'fasta'):
        query = str(record.seq)

        ## k1 and k2

        for i in range(0,len(query)):
            kmer = query[i:i+k1]
            print name, 'k1', i
            if kmer not in k1_used_bins:
                k1_found_bins[kmer] = 1
                k1_used_bins.add(kmer)
            else:
                k1_found_bins[kmer] = k1_found_bins[kmer] + 1  

        for i in range(0,len(query)):
            kmer = query[i:i+k2]
            print name, 'k2', i
            if kmer not in k2_used_bins:
                k2_found_bins[kmer] = 1
                k2_used_bins.add(kmer)
            else:
                k2_found_bins[kmer] = k2_found_bins[kmer] + 1

        ## k

        for i in range(0,len(query)):
            kmer = query[i:i+k]
            print name, 'k', i
            if kmer not in used_bins:
                found_bins[kmer] = 1
                used_bins.add(kmer)
            else:
                found_bins[kmer] = found_bins[kmer] + 1

    ## k0

    norm_bins = {}
    for kmer in found_bins.keys():
        if len(kmer) == k:
            kmer_1 = kmer[0:-1]
            kmer_2 = kmer[1:]
            kmer_3 = kmer[1:-1]
            bigL = len(query)
            kmer_0 = ((k1_found_bins[kmer_1] * k1_found_bins[kmer_2])
            / float(k2_found_bins[kmer_3])) * (((bigL - k + 1) * (bigL - k + 3))
            / float((bigL - k + 2) ** 2))
            kmer_norm = (found_bins[kmer] - kmer_0) / kmer_0
            norm_bins[kmer] = kmer_norm
            print name, kmer, kmer_norm

    ## fill out dictionary with 0 values for unrepresented kmers

    for nmer in bins:
        if nmer not in used_bins:
            norm_bins[nmer] = 0

    with gzip.open(name+'_'+str(k)+'mer_bins.txt.gz', 'wb') as bins_out:
        for each in sorted(norm_bins.keys()):
            print >> bins_out, each+'\t'+str(norm_bins[each])

It takes a little bit of time to calculate the normalized vector for each proteome you wish to analyze.  One way to greatly speed things up is to analyze the proteomes in parallel.  I’m using the Parallel function from the joblib module for this.  Getting it to work right wasn’t straightforward, and the rather kludgy method of printing each compositional vector to file in the above function was my workaround for not being able to create a dictionary of vectors in a parallelized loop.  In the below loop I’m generating a compositional vector for each proteome specified in names, starting a new process for each proteome up to the number of cpus available on the system (n_jobs = -1).

#### the the function in parallel on input files ####
Parallel(n_jobs = -1, verbose = 5)(delayed(calc_vector)
(name, bins, k1_bins, k2_bins) for name in names)

The final piece of the puzzle is to bring together the individual vectors into a single matrix that can be converted to a distance matrix (or whatever you wish to do with it).  There are a number of ways to do this.  I chose to generate a dictionary of all vectors, with the “name” as key.  The bins are in order (or should be!) in the list that makes up each dictionary value.  It potentially took a long time to calculate the matrix, so I save the dictionary as a pickle so that I can access the dictionary anytime in the future without having to re-calculate anything.  Note that the matrix is transposed, if you write to a text file be sure to change the orientation.  R and Matlab (I believe) will not read a matrix with 3.2 million columns (that many rows, and many more, are okay)!

#### bring the vectors together and create some permanent ####
#### output                                               ####

final_bins = {}
for f in os.listdir('.'):
    if f.endswith('bins.txt.gz'):
        name = re.split('_'+str(k)+'mer_bins', f)
        name = name[0]
        print name
        with gzip.open(f, 'rb') as bin_file:
            temp = []
            for line in bin_file:
                line = line.rstrip('\n')
                line = line.split('\t')
                bin_id = line[0]
                bin_num = line[1]
                temp.append(bin_num)
            final_bins[name] = temp

cPickle.dump(final_bins, open(str(k)+'mer_normalized_phylogeny_vector_output.p', 'wb'))
Posted in Research | Leave a comment

Top 10 scripting tricks for basic bioinformatics

In the process of preparing some data for publication I needed to revisit some scripts I wrote several months ago, clean them up, and re-run them.  I was pretty amazed with how bad they were, and ended up spending a couple of days rewriting them (necessary?  No… but it felt good).  Many of the changes between the old versions and the new version came down to the implementation of a relatively small number of tools.  Here are the top ten elements present in scripts I write now that have resulted in huge performance improvements over scripts I wrote, say, a year ago.  I can only hope the pace of improvement continues.  The elements are a mix of Python and shell commands or data structures.  If you’re an experienced hand at writing scripts these are pretty basic tips, but if you’re just starting out you might find them useful.  If you’ve got a life-saving trick of your own leave a comment.

Number 10: the tab delimited format

I’ll be honest.  The xml format scares me, but not as bad as the newick format and some of the other more obscure “standard” data formats that crop up in bioinformatics.  Perhaps with more experience I’ll get over this, but for now nothing makes me happier than the tab-delimited format.  Easy to read, easy to parse.  Even some formats developed by people sophisticated enough to come up with something more complex are tab-delimited (the SAM format, for example).  The three tools I use the most on a day to day basis are bash, Python, and R.  With everything tab-delimited they all play nice together.  I try to keep as much data in this format as possible.

Number 9: head and tail

Not a scripting technique per se, and one of the first shell commands most people learn.  Have a 5 G tab-delimited file of unknown columns?  Use head -X to quickly access the first X number of lines.  Need a small test file for your next script?  Redirect that same head command to a file and you’re all set.  Same applies to tail, except it returns the bottom X lines.

Number 8: the redirect

Also one of the first shell techniques most people learn.  Redirects allow you to send files to shell commands and capture output.  For example if you want a file of all the sequence names in a very large fasta, grep and a redirect will do the trick:

fgrep '>' big.fasta > big.fasta.names.txt

Related to the redirect is the pipe.  You can send the output of one command straight to another with the “|” character.  For example if you need to concatenate thousands of fasta files and then compress them:

cat *.fasta | gzip

This eliminates the large, uncompressed intermediate file.

Number 7: dictionaries

I didn’t really get dictionaries when I took a class on Python as a new graduate student.  I understood what they did, but not how that was particularly useful.  In the last script I wrote I think I implemented four different dictionaries, mapping gi number to ncbi id, gi to taxonomic id, ncbi id to taxonomy, and finally actual sequence information to taxonomy (or something along those lines).  This multi-step mapping would have been extremely complex and inefficient using a different data structure.  One conceptual breakthrough came when I realized dictionary values could be anything – lists, sets, even other dictionaries.  It’s a great way to organize and quickly access a lot of information.

Number 6: parallel

Parallel computing baffles me a little.  I have about a 50 % success rate implementing parallel cluster computing on various projects using MPI, and a 0 % success rate implementing the Python parallel computing modules.  The gnu “parallel” command is a cozy warm blanket in the cold hard world of parallel computing.  If you have multiple cpus on the same motherboard (for example our lab’s MacPro workstation) and a job you can easily split into many small jobs, parallel allows simultaneous execution of as many jobs as you have cores.  It works great on loops in bash shells and also on multiple commands in a bash shell script.  I’ve found the simplest way to execute parallel is to use Python to split my job and create a bash script with one line per job (each line might contain multiple commands).  Redirecting the script into parallel results in parallel execution.  Parallel is smart enough to capture and sort output from all the threads, even if its going to the same place.

parallel < my_awesome_script.sh > my_dissertation.txt

Number 5: grep

Another basic shell command.  Grep, fgrep, and zgrep are magic commands that can extract critical lines of data based on pattern matching.  If you aren’t using a nested file format (if you are using tab-delimited or similar) the grep commands can quickly subset a massive file to just those lines you need.  It’s worth looking at all the options available for these commands, which make them quite powerful.

Number 4: “vectorized” commands

I’m not sure if “vectorized commands” is the right term, but I’ve heard that applied to R which discourages the use of loops and relies very heavily on commands that iterate intrinsically.  Python tends to rely more heavily on loops, but it has a number of these intrinsic iterators that are much, much faster than loops.  For example to match a string to a list of strings you might be tempted to use a loop:

for s in some_list:
     if my_string == s:
          do something

However the “in” command will perform much faster:

if my_string in some_list:
     do something

Number 3: gzip

Bioinformatics is big files.  You can’t escape from them.  The gigabytes keep growing until they’re terabytes, and then I don’t really want to think about what happens.  And of course you have to move all these files from computer to computer and try to do things with them.  Python (using the gzip module) and R (natively) both play well with gzip.  So does zgrep.  I try to keep everything larger than a couple of megabytes in the gzip format.  This saves a ton of space and of course file transfers are faster.

Number 2: with

With is another Python command that I didn’t really get at first.  Then came the day that I needed to open a many gigabyte file in Python and couldn’t do it.  The “with open(…) as handle:” command opens the file for reading without holding the whole thing in memory.  Add an iterator to loop across the lines.  Downside, it adds one more indentation level to that block of your code.  Upside, you can add multiple “open as” segments to the with line:

with open('file_1.txt', 'r') as file1, \
open('file_2.txt.gz', 'rb') as file2, \
open('output.txt', 'w') as output:
     for line in file1:
          do lots of stuff

Number 1: the set

The single most useful scripting trick I’ve learned over the last year is the set.  Sets are containers of unique, unordered elements.  Searching across a set is lightning fast compared to searching a across a list.  Sets convert to and from ordered lists easily enough if ordering is important and duplicate entries are not.  I almost never search across Python lists anymore.  Sets have allowed me to create some scripts that will run overnight that would have taken weeks or months with lists.

Happy scripting!

Posted in Research | 2 Comments

The future of phylogeny

Yesterday I went to an inspiring informal seminar by Bailin Hao from Fudan University in Beijing.  Bailin presented work by his group to establish a new method for determining microbial phylogeny.  Inferring phylogeny is currently a massive headache in the field of microbiology.  Traditionally phylogeny was determined by aligning a marker gene (typically the 16S rRNA gene) from a set of bacteria, and inferring relatedness within the set from differences in these alignments.  These differences are the result of insertions, deletions, and base changes accumulated over time as the gene evolved independently for each analyzed strain.

Although this approach has served the study of microbial evolution and ecology well it has some serious shortcomings.  First, phylogenetic inference from  gene alignment is hard.  A lot of work has gone into developing sophisticated statistical models that predict how likely a predicted mutation will occur (and thus when gaps should appear in an alignment), but the best model can’t really capture what’s occurring in nature.  No alignment is perfect, and even imperfect alignments take a lot of time and computer power to calculate.  Second, genomes don’t always evolve in a predictable vertical fashion, with a daughter genome acquiring all the genes of the parent genome and perhaps a beneficial mutation or two.  Single celled organisms often pick up genes they encounter in the environment.  If beneficial those genes can find a permanent home in the genome.  There is no guarantee that the source organism of a “horizontally transferred” gene is closely related to the recipient, thus a phylogenetic inference based on a horizontally transferred gene will be incorrect with respect to the whole genome.  Third is the problem of what taxonomy means anyway?  If a bacterial genome is (hypothetically) 30 % horizontally acquired genes, and has a radically different phenotype than its closest relatives as a result of these transfers, what is the value of determining traditional taxonomy?

Enter compositional vector based phylogeny.  This method is just one of a new generation of sequence comparison approaches that use kmers to determine similarity.  A kmer is a word (of a length set by the investigator) created from a sliding window moving along the sequence in question.  For example the sequence AGCTCCGGC would have the 4 letter kmers AGCT, GCTC, CTCC, TCCG, CCGG, and CGGC.  The more closely related two sequences are, the more kmers they will have in common.  The kmers in common can be counted without aligning the two sequences, thus eliminating one of the most time and computer intensive parts of sequence comparison.

In the compositional vector approach to phylogeny a complete microbial proteome is converted to a vector of all possible amino acid kmers of a specified size.  For a five letter kmer the possible kmers are AAAAA through VVVVV, with 3,200,000 possible words (20 ^ 5).  Tallying the number of occurrences of each possible kmer in the proteome produces a vector.  Converting two or more vectors to a distance matrix provides a measure of phylogenetic distance.

What really struck me was the relative simplicity of this method.  In reality there are some important additional steps, namely the vectors need to be normalized to account for neutral evolution, but any investigator can use this method without sophisticated tools.  Python’s probably not the best language for this, but the concept can be explored with a simple Python script.  The script below will calculate a matrix of compositional vectors for some number of proteomes, WITHOUT the important normalization step.  It’s not fast, but it can calculate the compositional vectors for four proteomes (in the example obtained from UniProt) in a few minutes time.

import cPickle
import os

## if you haven't already done so generate all possible 
## five letter words

if '5mers.set' not in os.listdir('.'):
    aa = ['A','R','N','D','C','E','Q','G','H','I',\
    'L','K','M','F','P','S','T','W','Y','V']   
    bins = set()    
    temp = [0,0,0,0,0]

    for residue_1 in aa:
        temp[0] = residue_1
        for residue_2 in aa:
            temp[1] = residue_2
            for residue_3 in aa:
                temp[2] = residue_3
                for residue_4 in aa:
                    temp[3] = residue_4
                    for residue_5 in aa:
                        temp[4] = residue_5
                        temp_string = ''.join(temp)
                        print temp_string
                        bins.add(temp_string)

    fivemers = open('5mers.set', 'wb')                   
    cPickle.dump(bins, fivemers)
    fivemers.close()

else:
    fivemers = open('5mers.set', 'rb')
    bins = cPickle.load(fivemers)
    fivemers.close()

## set up a matrix to hold kmer tally
vec_matrix = []
for b in sorted(bins):
    l = [b]
    vec_matrix.append(l)

## concantenate proteins in multifasta to single string

for seq in ['34H_proteome.fasta', \
'agrobacterium_tumefaciens.fasta', \
'sinorhizobium_medicae.fasta', \
'sinorhizobium_meliloti.fasta']:    
    with open(seq, 'r') as fasta:
        query = ''
        for line in fasta:
            if line.startswith('>') == False:
                line = line.rstrip('\n')
                query = query+line

    ## search proteome for all 5 letter kmers, and tally occurrence

    found_bins = {}
    used_bins = set()   

    for i in range(0,len(query)):
        kmer = query[i:i+5]
        print i
        if kmer not in used_bins:
            found_bins[kmer] = 1
            used_bins.add(kmer)
        else:
            found_bins[kmer] = found_bins[kmer] + 1

    ## fill out dictionary with 0 values for unrepresented kmers

    for fivemer in bins:
        if fivemer not in used_bins:
            found_bins[fivemer] = 0

    for i,r in enumerate(sorted(bins)):
        vec_matrix[i].append(found_bins[r])

## print matrix as tab delimited text file

output = open('phylogeny_vector_output.txt', 'w')        
for row in vec_matrix:
    for i,e in enumerate(row):
        if i+1 != len(row):
            print >> output, str(e)+'\t',
        else:
            print >> output, str(e)   
output.close()

Here’s what the output looks like:

AAAAA	12	90	125	133
AAAAC	0	1	6	3
AAAAD	2	12	21	28
AAAAE	5	26	53	59
AAAAF	3	21	25	30
AAAAG	8	42	62	69
AAAAH	0	6	9	11
AAAAI	8	30	44	46
AAAAK	2	27	20	26
AAAAL	16	50	73	75

The whole matrix is ~ 50 mb, for anything larger than a few proteomes it would be advisable to compress the output (cool trick – I only recently learned that R will read in a .gz file with no additional prompting, this has saved me a ton of storage space of late).  To look at the relatedness of the strains you could use R or any other data analysis platform.  For this case a dendrogram derived from a distance matrix of the original data should illustrate compositional relatedness.  Again, I stress that this is an overly-simplistic interpretation of the actual method.  In R…

d <- t(as.matrix(read.table('phylogeny_vector_output.txt',
                            row.names = 1)))
d_dist <- dist(d, "euclidean")
d_clust <- hclust(d_dist)
plot(as.dendrogram(d_clust))

Voilà, a dendrogram of relatedness!  V2-5 correspond to the order of the original composition matrix, or Colwellia psychrerythraea, Agrobacterium tumefaciens, Sinorhizobium medicae, and Sinorhizobium meliloti.  As we would expect Colwellia, a Gammaproteobacteria, clusters separately from the other strains, all Alphaproteobacteria.  The two Sinorhizobium strains cluster closest.  For more about this approach, and details on the full method, check out Li, Xu, and Hao, 2010 in the Journal of Biotechnology.  The Hao lab has an excellent website in beta where all available genomes are represented on an interactive tree: http://tlife.fudan.edu.cn/cvtree/.

Dendrogram of Euclidean distance based on complete proteome compositional vectors for Colwellia psychrerythraea (V2), Agrobacterium tumefaciens (V3), Sinorhizobium medicae (V4), and Sinorhizobium meliloti (V5).

Posted in Research | 1 Comment

Somethings should be easy, and they are…

**UPDATE**
Got it.  You have to set the BLASTDB variable for this to work and place the taxonomy files in this same location.  I just did it in my .bash_profile:

BLASTDB=”/path/to/databases”
export BLASTDB

**UPDATE**
The original motivation for this was to obtain taxonomy data for blast search results.  Sometime in the last few months this feature actually slipped into the BLAST+ package.  I can’t actually get the feature to work, but perhaps you can!  It requires the installation of two special taxonomy files found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz.  Typically cryptic instructions can be found on the command line blast manual page at http://www.ncbi.nlm.nih.gov/books/NBK1763/, along with the necessary flags to get the taxonomy to appear (ha!) in table format output.  I suspect that my inability to get this to work stems from my dislike for mucking with the blast profile file.  Instead of setting a default database location I use the complete path in my calls.  Regardless of where I locate the taxonomy database blast can’t seem to find it 🙁

****

A couple of months ago I wrote a post on the difficulty of parsing the gi_taxid_nucl.dmp.gz file that maps sequence gi numbers to their NCBI taxonomy number (taxid).  With a taxid it is possible to obtain the actual taxonomy of (some) sequences in Genbank.  I proposed a couple of elaborate solutions, some of which were improved upon in the comments section by other contributors.  One particularly good suggestion was to use primary keys in a SQL database for fast look-ups.

In my latest analysis I have ~ 50,000 gi numbers to map to taxids and had to revisit this issue again.  After some tinkering I realized I’d missed a very obvious and reasonably efficient way to extract the relevant lines from the gi_taxid_nucl/prot.dmp.gz file.  If you have only a few gi numbers to map stick with the SQL or grep solutions.  If you have many give this method a try.  Making use of Python’s with statement it is possible to very quickly search each line of the .dmp.gz against a known collection of gi numbers contained in a set.  I only recently discovered Python sets, and they have already saved me numerous times.  Searching against items in a set is orders of magnitude faster than searching against items in a list.

# assuming that gis is a set of gi numbers...
import gzip
gi_taxid_output = open('small_gi_taxid.txt', 'w')
with gzip.open('gi_taxid_prot.dmp.gz', 'r') as dmp:
    found = 0
    for line in dmp:
        line_split = line.split('\t')
        print line_split[0], found
        if str(line_split[0]) in gis:
            found = found + 1
            print >> gi_taxid_output, line

Using this method a search for 50,000 gi numbers takes ~ 1 hour for gi_taxid_nucl.dmp.gz and about 30 minutes for gi_taxid_prot.dmp.gz.  The grep method, even utilizing the gnu parallel command for an 8 fold speed increase (on an 8 core machine), would have taken at least overnight and produced a messy gi_taxid file with many erroneous hits.

Posted in Research | 6 Comments

A particularly bad idea

I haven’t written many posts on this blog that are political in nature, in fact I think the only political topic that’s come across is the need for new investment in the US ice breaker fleet.  A particularly troubling bill is under development in congress however, titled the “Quality in Research Act”, that, if passed, will do anything but guarantee high quality in our nation’s research.  As a working scientist it is difficult to not have some opinions about it.

A number of blogs have given the proposed act a pretty complete treatment, in particular this post on AmericanScience.  Rather than speculate on the motivation of the bill’s author (Lamar Smith (R-TX); naivety?  anti-science philosophy?), and rather than repeat what’s being said in other forums, I’ll just add a couple of points that seem to be missing from the discussion.  According to AmericanScience the bill will include new criteria for National Science Foundation (NSF) funded projects.  To meet the proposed criteria a project:

1.  Is in the interest of the United States to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science.

2.  Is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

3.  Is not duplicative of other research projects being funded by the foundation or other Federal science agencies.

All three of these criteria have some deep flaws as written.  I’ll address the second and third criteria, which are the more troubling.  But first, to establish some context, what exactly is the National Science Foundation and how much money does the federal government spend on research?  NSF is one of several government or quasi-government entities that provides Federal funds for research.  It is not the largest such entity, the National Institute of Health (NIH) and the research branches of the Department of Defense (DOD) have much larger budgets.  NASA also has a much larger budget, but the amount spent on actual research is a bit less than what NSF invests.  NOAA, USGS, DOE, EPA and other agencies also conduct research, but this research tends to be mission-oriented and primarily conducted in-house.

The Obama administration’s request for NSF for 2014 amounts to $7.6 billion, a pretty small fraction of the federal budget.  By comparison the NIH request for 2014 is $31.33 billion.  I don’t begrudge NIH and NIH funded investigators a penny of that amount, but the difference highlights a flaw in how the Federal government thinks about science.  The mission of NIH is fundamentally different than that of NSF; the majority of NIH funds goes to applied research – concepts that had a genesis in basic research but that have matured to the point where they can begin to solve people’s problems.  NSF is the primary source of funding for that genesis – it is the single largest source of Federal funding for basic research.

This gets right to the heart of what’s wrong with the second point in the proposed criteria.  It suggests that the authors hope to eliminate the concept of basic research, by requiring projects to solve problems already identified as having the utmost importance to society.  I posit that many of the currently identified problems (and/or currently employed solutions) faced by society would not have been identified under this constraint.  A strength of the current process of peer review within NSF (and the other federal research agencies) is that one expert who sees a possible, unsolved piece of a potential problem can decide to pursue it further, if they can convince a jury of their peers and a project manager (typically a senior scientist on loan to NSF from academia) that the concept has merit.  This is no easy task; something like 10 % of all grant proposals to NSF actually get funded, but a well-made case stands a chance.

In the worst-case and rare scenario where NSF allocates funds to a duplicative or flawed project (a typical NSF project might be $300-400,000 over a three year period), these public funds are still not going to waste.  Unlike in the private sector there is no accumulation of wealth in academia.  Federal money flowing into academia flows right back out (typically to the private sector, but also right back to Federal coffers).  Of $400,000 allocated to a university professor for a 3 year grant, 50 % (varying by institution) goes to the host university as overhead.  With these funds the university supports capital projects (providing jobs), hires staff (more jobs), and improves the educational experience of its students (so they can get jobs).  Since the funded professor is unlikely to have time to actually conduct the research, they will hire a graduate student (a job, at less than the annual earnings of a Starbucks barista) and probably employ a couple of undergraduates part time.  The graduate student will need materials and equipment to carry out the research, which funnels a portion of the NSF investment straight to the private sector.  If there’s anything left from that original $400,000 the professor and graduate students might present their findings at a conference (more private sector – US airlines only, by existing rules!), and the professor might actually pay themselves some salary (they’ve been busy teaching, advising, conducting outreach, and otherwise supporting their institution while the graduate student carries out the project – but they are typically only paid part time for these activities).

That got a little longer than I originally intended, so I’ll only raise a quick objection to the third point in the proposed criteria.  It makes no sense to enforce a rule against duplicate research for two reasons.  First, the scientific community is quite good a policing itself on this account.  The reviewers on an NSF proposal review panel are working scientists.  Their own grants are competing for the same limited pool of funds.  Does Lamar Smith really think that they aren’t going to be harsh on a proposed project that will unnecessarily duplicate work already being funded?  Projects do need to overlap, and problems do need to be probed from different angles by different research groups.  It is probably not clear to someone outside a given field however, when two project are complimentary and when they simply duplicate work.  So why not leave this assessment to the experts on a review panel?

Clearly there is a lot that the scientific community can do better, in terms of how it conducts research and how it holds its members accountable.  There is a lot of momentum in the community to improve these things.  Despite these shortcomings science has done a pretty good job, fostering a system of (reasonably) open exchange, and a system for evidenced-based progression that is unparalleled elsewhere in society.  I can’t resist the barb that it’s a bit presumptuous for congress to question this system, at least before they develop one that works equally well…

Posted in Uncategorized | 1 Comment

Entry-level computers for bioinformatics

When I started graduate school I new absolutely nothing about computing on anything more high performance than a laptop.  I assumed that clusters and servers were exclusive to large, well-funded labs.  Access to these items is a huge limiting factor in almost any field of research.  This is certainly the case in microbiology where it is increasingly difficult to close the loop on an interesting story without volumes of sequence data.  At the time (and for a long time after) I was indignant at this seeming injustice – only large labs with large budgets had access to the large computers required for the most compelling investigations (the results of which enable additional funding…).  Over time I’ve slowly come to appreciate that this is not the case.  A high performance computer is cheaper than many items deemed essential to a microbiology lab (-80 C freezer (~ $10 K), microscope (~ $20 K)).  I suspect there are other students out there with similar misconceptions, so here is a short compilation of things I’ve learned regarding research computers.  Imagine that you are a new faculty member, or a postdoc with a little fellowship money, looking to purchase a computer for basic bioinformatics (e.g. genome assembly and annotation).  What will you have to spend?

A necessary consideration is how you choose to use a computer in a research setting.  Our lab’s “server” is a MacPro workstation (more on that shortly).  After some trial and error I settled into an arrangement where I work almost exclusively from my laptop.  I use ssh and scp to access the server, and almost never sit in front of it.  The rest of this post assumes that the purchaser works in a similar fashion.  It’s a matter of personal style, other lab members like to park in front of the monitor (I’d rather park in the coffee shop, or at least my office).  Unlike almost everyone else in the world of bioinformatics my laptop is a PC.  Again, a matter of style.  It works great for me.  The point is that it works almost seamlessly with the MacPro, and all other Unix/Linux servers.  It really doesn’t matter what you connect to it with, the server is the important thing.  It’s worth bringing this up because, as a graduate student, I’d rather have a cheap laptop and a decent remote workstation than a nice laptop (of course in an ideal world you would have both).  If you can’t ask your advisor for both, ask for the workstation…

Our MacPro was purchased back in 2008, for all the right reasons.  Next generation sequencing, which generates orders of magnitude more data than older methods, had just been introduced.  A Mac with 8G of RAM and 8 cores seemed like more than enough computer.  Nobody in either of the labs using the computer knew much about Unix or Linux, so it didn’t seem like there was much of a choice.  Unfortunately in this decision a cardinal rule of computers was violated – a computer deemed sufficient to do the job you are purchasing it for is likely to be insufficient for the next job.  It’s hard to justify spending  money on additional resources without a clear use for them, but it just has to be done in some cases.  When I started working with next generation data I upped the RAM to 24 G and added an additional 2T HD.  Theses upgrades make the MacPro suitable for 75 % of the tasks I need to perform on a day to day basis, but the other 25 % is kind of essential…

The top of the line MacPro today, with 8T of storage, 12 cores, and 64 G of RAM will run you around $6,600 (tower only).  For a long time I assumed this was the only way to go – more capable workstations must be prohibitively more expensive, right?  Since there’s a lot that you can’t do with only 64 G RAM and 12 cores I assumed that that work was the sole purview of well funded labs.  Sometime back I explained this dilemma to a colleague in our geology department who deals with large geospatial datasets, and who clued me in on the price of computers.

Mac’s are friendly,  but some friends can be bossy and expensive to hang out with.  In this case it takes a bit of tinkering to make a Mac work well as a Unix server.  If that’s how you’re using it than you aren’t using the features that you’ve paid a lot of money for – namely the Mac OS GUI front end.  True Unix/Linux workstations are comparatively cheap.  For ~ 5 K I found two local companies who could provide a server-grade tower with 128 G RAM (minimal for many applications, but expandable to 512 G), and 2 x 8 Intel Xeon processors (16 cores total).  There’s plenty that such a machine can’t do, but it’s not a bad start.

I know that other groups have migrated to Amazon ECS for research computing, but there are clear downsides.  Amazon “instances” are expensive, and limited to 64 G RAM.  If you aren’t a computer expert you could spend a lot of money just trying to figure out how to make things work.  It also seems that data exploration would be limited by cost, and by the slight difficulty inherent in setting up instances.  Most research institutions have some kind of high performance computer environment; I’ve used the University of Washington’s Hyak cluster for a lot of my work.  This environment is great when many processors are desired, but, as with Amazon ECS, it is expensive and memory limited.  I’m pretty new at this and just getting a sense of what’s out there.  If you’ve had a good or bad experience with any of this technology leave a comment!

Posted in Research | 1 Comment

Video output from R script

As an offshoot from one of my dissertation projects I’m developing a simple model to explore protein evolution.  The model takes an amino acid sequence and mutates it, selecting for mutations that result in improvement for one of several measurable protein parameters (flexibility, hydropathy, aliphatic index, isoelectric point, or aromaticity).  I’m using it to explore how selection for one parameter results in unintended changes to the other parameters.  I implemented the model in Python but I’m using R for data analysis and plotting.  For each run I have several hundred measures of the five different protein parameters.  There are a few ways to visualize this, but since the dataset represents a temporal progression it makes sense to create an animation of the output.  After some exploration this is the method that I came up with.

Assuming that you have some table of data (evol_param_d_matrix_1 in this example) in R for which a row represents a single time point, you can generate a figure (I chose a barplot) for each time point as such:

jpeg(filename = 'animations/Rplot%03d.jpg')
for (i in 1:length(evol_param_d_matrix_1[,1])){
  barplot(evol_param_d_matrix_1[i,],
          ylim = c(-0.5, 1.5),
          axisnames = T
          )
}
dev.off()

Note that it is extremely important to specify ylim (and xlim, if this axis might vary for your data) to prevent the scale of the plot changing from image to image.  The above code saves each plot (one per line in evol_param_d_matrix_1) as a jpeg to the directory “animations”, labeled Rplot000.jpg, Rplot001.jpg, etc.  To convert this sequence of still images to video I used the free tool (for linux/mac) ffmpeg.  For mac install using homebrew.  I called ffmpeg using:

ffmpeg -r 15 -qscale 1 -i 'Rplot%03d.jpg' test.mp4

That command tells ffmpeg to use 15 frames per second, with a quality of “1” (the highest quality), on all jpegs matching the given naming pattern.  I first tried to use Rplot*.jpg, but this ordered the frames alphanumerically, rather than numerically. The output file is of course test.mp4.  Changing to test.avi would instead output in that format.  For reasons unknown to me, neither the command tee nor the redirect could capture ffmpegs screen output to a logfile, which might have been useful for the initial troubleshooting.

The ffmpeg documentation is, unfortunately, rather dense and written for people with a lot more audio/visual background than I have.  It looks like ffmpeg is a commonly used tool however, so there are quite a few easy(ish) to follow tutorials online.  Here’s the final product:

ffmpeg animation of barplots produced in R.
Posted in Research | 1 Comment