**UPDATE**
Got it. You have to set the BLASTDB variable for this to work and place the taxonomy files in this same location. I just did it in my .bash_profile:
BLASTDB=”/path/to/databases”
export BLASTDB
**UPDATE**
The original motivation for this was to obtain taxonomy data for blast search results. Sometime in the last few months this feature actually slipped into the BLAST+ package. I can’t actually get the feature to work, but perhaps you can! It requires the installation of two special taxonomy files found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz. Typically cryptic instructions can be found on the command line blast manual page at http://www.ncbi.nlm.nih.gov/books/NBK1763/, along with the necessary flags to get the taxonomy to appear (ha!) in table format output. I suspect that my inability to get this to work stems from my dislike for mucking with the blast profile file. Instead of setting a default database location I use the complete path in my calls. Regardless of where I locate the taxonomy database blast can’t seem to find it 🙁
****
A couple of months ago I wrote a post on the difficulty of parsing the gi_taxid_nucl.dmp.gz file that maps sequence gi numbers to their NCBI taxonomy number (taxid). With a taxid it is possible to obtain the actual taxonomy of (some) sequences in Genbank. I proposed a couple of elaborate solutions, some of which were improved upon in the comments section by other contributors. One particularly good suggestion was to use primary keys in a SQL database for fast look-ups.
In my latest analysis I have ~ 50,000 gi numbers to map to taxids and had to revisit this issue again. After some tinkering I realized I’d missed a very obvious and reasonably efficient way to extract the relevant lines from the gi_taxid_nucl/prot.dmp.gz file. If you have only a few gi numbers to map stick with the SQL or grep solutions. If you have many give this method a try. Making use of Python’s with statement it is possible to very quickly search each line of the .dmp.gz against a known collection of gi numbers contained in a set. I only recently discovered Python sets, and they have already saved me numerous times. Searching against items in a set is orders of magnitude faster than searching against items in a list.
# assuming that gis is a set of gi numbers... import gzip gi_taxid_output = open('small_gi_taxid.txt', 'w') with gzip.open('gi_taxid_prot.dmp.gz', 'r') as dmp: found = 0 for line in dmp: line_split = line.split('\t') print line_split[0], found if str(line_split[0]) in gis: found = found + 1 print >> gi_taxid_output, line
Using this method a search for 50,000 gi numbers takes ~ 1 hour for gi_taxid_nucl.dmp.gz and about 30 minutes for gi_taxid_prot.dmp.gz. The grep method, even utilizing the gnu parallel command for an 8 fold speed increase (on an 8 core machine), would have taken at least overnight and produced a messy gi_taxid file with many erroneous hits.
6 Responses to Somethings should be easy, and they are…