Somethings should be easy, and they are…

**UPDATE**
Got it.  You have to set the BLASTDB variable for this to work and place the taxonomy files in this same location.  I just did it in my .bash_profile:

BLASTDB=”/path/to/databases”
export BLASTDB

**UPDATE**
The original motivation for this was to obtain taxonomy data for blast search results.  Sometime in the last few months this feature actually slipped into the BLAST+ package.  I can’t actually get the feature to work, but perhaps you can!  It requires the installation of two special taxonomy files found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz.  Typically cryptic instructions can be found on the command line blast manual page at http://www.ncbi.nlm.nih.gov/books/NBK1763/, along with the necessary flags to get the taxonomy to appear (ha!) in table format output.  I suspect that my inability to get this to work stems from my dislike for mucking with the blast profile file.  Instead of setting a default database location I use the complete path in my calls.  Regardless of where I locate the taxonomy database blast can’t seem to find it 🙁

****

A couple of months ago I wrote a post on the difficulty of parsing the gi_taxid_nucl.dmp.gz file that maps sequence gi numbers to their NCBI taxonomy number (taxid).  With a taxid it is possible to obtain the actual taxonomy of (some) sequences in Genbank.  I proposed a couple of elaborate solutions, some of which were improved upon in the comments section by other contributors.  One particularly good suggestion was to use primary keys in a SQL database for fast look-ups.

In my latest analysis I have ~ 50,000 gi numbers to map to taxids and had to revisit this issue again.  After some tinkering I realized I’d missed a very obvious and reasonably efficient way to extract the relevant lines from the gi_taxid_nucl/prot.dmp.gz file.  If you have only a few gi numbers to map stick with the SQL or grep solutions.  If you have many give this method a try.  Making use of Python’s with statement it is possible to very quickly search each line of the .dmp.gz against a known collection of gi numbers contained in a set.  I only recently discovered Python sets, and they have already saved me numerous times.  Searching against items in a set is orders of magnitude faster than searching against items in a list.

# assuming that gis is a set of gi numbers...
import gzip
gi_taxid_output = open('small_gi_taxid.txt', 'w')
with gzip.open('gi_taxid_prot.dmp.gz', 'r') as dmp:
    found = 0
    for line in dmp:
        line_split = line.split('\t')
        print line_split[0], found
        if str(line_split[0]) in gis:
            found = found + 1
            print >> gi_taxid_output, line

Using this method a search for 50,000 gi numbers takes ~ 1 hour for gi_taxid_nucl.dmp.gz and about 30 minutes for gi_taxid_prot.dmp.gz.  The grep method, even utilizing the gnu parallel command for an 8 fold speed increase (on an 8 core machine), would have taken at least overnight and produced a messy gi_taxid file with many erroneous hits.

16296 Total Views 4 Views Today
This entry was posted in Research. Bookmark the permalink.

6 Responses to Somethings should be easy, and they are…