DIAMOND – A game changer?

DIAMOND search speed over BLASTX

Figure from Buchfink et al. 2014. The figure shows the speed up of DIAMOND over BLASTX and an alternate new-generation aligner known as RAPSearch2. Metagenomic datasets from the Human Microbiome Project were aligned against the NCBI nr database.

A very exciting paper was published in Nature Methods a couple weeks ago by Benjamin Buchfink and colleagues: Fast and Sensitive Protein Alignment Using DIAMOND.  The paper debuts the DIAMOND software, touted as a much-needed replacement for BLASTX.  BLASTX has been a bioinformatics workhorse for many years and is (was) the best method to match a DNA sequence against a protein database.  BLASTX worked well in the era of Sanger sequencing.  With routine sequence runs today producing many gigabytes of data (100’s of millions of reads) however, BLASTX is woefully inadequate.   I once attempted to annotate a relatively small metagenome using BLASTX against the NCBI nr database using a high performance cluster of some size and couldn’t to it.  It was theoretically possible, but the resources required were out of alignment with what I stood to gain from a complete annotation.  I pulled the plug on the experiment at around 25 % complete to avoid getting blacklisted.

There are faster alignment (i.e. search) algorithms available, for example the excellent BWA DNA-DNA aligner, but they don’t quite do what BLAST does.  BLAST is valuable for the same reason that it is slow; it’s sensitive.  Another way of thinking about this is that BLAST has good methods for aligning dissimilar sequences and thus returning a more dissimilar (but still desired) match.  Many sequences queried against, for example, the NCBI nr database don’t have a close match in that database.  The more dissimilar the query is from a candidate homologue the more calculations need to be performed for the aligner to propose a biologically plausible alignment.  This is bad enough when aligning two DNA sequences.  Because the amino acid alphabet is larger than the DNA alphabet however, comparing protein sequences is even more computationally intensive.

The developers of DIAMOND used improved algorithms and additional heuristics to build a better aligner.  I’m not going to attempt a detailed description of the algorithms – which I would certainly botch – but the paper refers to three modifications made to the BLAST concept that result in a huge speedup for DIAMOND.  First, DIAMOND uses an optimized subset (seed) of the query and reference sequences to find matches.  The subset is described by the seed weight and shape.  Second, the aligner employs something called double indexing, an improved method for storing information regarding seed position within each sequence.  Finally, the aligner relies on a reduced amino acid alphabet consisting of only 11 amino acids.

So how fast is DIAMOND?  Really, really fast.  The paper describes some basic benchmarks.  I tried it out on an 12 core Linux box.  I did not assess accuracy, but for a basic search it was everything it promised to be and easy to use (what is bioinformatics coming to?).  The pre-compiled Unix executable worked straight out of the box and the DIAMOND developers have kindly copied the BLASTX command structure.  To try it out I aligned the metagenome described above against the Uniref-90 database as such:

diamond makedb --in uniref90.fasta -d uniref90 &&
date > start_stop.txt
diamond blastx -d uniref90 44_trim.mates.fasta -o diamond_test.txt
date >> start_stop.txt

The metagenome contains just over 12 million Illumina sequence reads and this (slightly old) version of Uniref-90 contains just over 15 million.  That’s a lot smaller than nr, but still pretty big.  The DIAMOND default is to use the maximum number of cores available – on this hyperthreaded system it recognized 24.  The whole alignment took only 17 minutes and never more than 16 Gb of memory.  This is such a large speedup over BLASTX that I’m having a hard time wrapping my head around it.  There’s no way that I’m aware of to estimate how long it would take to execute a similar BLASTX search but I think it would be weeks.  It’s hard to convey how exciting this is.  DIAMOND may have just eliminated one of the biggest analytical bottlenecks for environmental sequence analysis (to be replaced by a larger one, I’m sure).

19713 Total Views 12 Views Today
This entry was posted in Research. Bookmark the permalink.

4 Responses to DIAMOND – A game changer?

  1. Will Nelson says:

    I haven’t heard much subsequent discussion of this program. What is your sense of people’s thinking about it? Its speed is obviously outstanding, and based on their data and on the way the program works it doesn’t seem like sensitivity should be hugely compromised. However, in my tests it does seem considerably less sensitive. I run a small online annotation service which I am currently converting to use Diamond, since it is as far as I can tell the only program currently capable of handling the NR database.

    • Avatar photo Jeff says:

      I agree that there hasn’t been a lot of discussion, although they have netted 41 citations – not bad for a paper that’s been out less than a year. I also haven’t done a real side by side comparison yet with other programs of greater sensitivity. How much have you played around with the different options/parameters?

      • Will Nelson says:

        I haven’t done any really systematic testing, but I have re-run a 78Mb plant transcript set which we previously annotated to Uniprot, using blast. We previously found hits at 1E-10 for 61% of the sequences, whereas with Diamond vs. the NR database I get only 41%, and that is at 1E-3 (same result at 1E-5). Of course it took 78 minutes, compared to a week with Blast.
        I also tried the –sensitive setting, which only raised the percentage to 42%, while taking 3x as long.

        It makes sense to me that the longer seeds of Diamond would be rather less sensitive than blast, and also less than programs like Ublast, or my own Fastannot program, which use 5mer seeds. However for the moment Diamond seems like basically the only game in town for aligning to NR. Am I missing anything?

        • Avatar photo Jeff says:

          Have you played around with LAST? I have not, but it might be of interest. Other than that I think you’re right, DIAMOND might be the best bet for large scale annotations. I try to avoid the whole mess by doing highly targeted annotations whenever possible, or for metagenomes annotating assembled contigs and then mapping.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield