By: Audrey Bourret

Audrey Bourret — Tue, 01 Oct 2019 18:52:02 +0000

Thank you so much, this post allowed me to finally get taxonomy on my blast results …

By: Lavinia

Lavinia — Sun, 19 Apr 2015 11:25:57 +0000

Thanks Jeff, I was just about to wade through the BLAST manual to see how to implement this, Google led me here – exactly what I was after, thanks.

By: tallnutt

tallnutt — Wed, 09 Jul 2014 01:50:14 +0000

Hi,

I’ve had the same problem for a couple of years now, and still not really found the solution. In metagenomics, methods seem to be moving toward using online services like MGRAST, who presumably have their own taxonomy databases. However, I still like to be able to run my own analyses locally. The quickest way I’ve found to do this is to use a binary search of the taxonomy db in python, see code below. I can’t take credit for the algorithm, whih I got from the net somewhere, but sorry its from too many places to remember them all to give proper credit.. This retrieves the tax_id very quickly indeed.. but the slow part is then getting the scientific name from Entrez, which I have only managed to do via their server one at a time (maybe this could be speeded up doing several at a time, but I got errors trying that and gave up so far). Anyway, perhaps the code below will be useful. Theo.

from Bio import Entrez
import sys
import os

#taxonfetch.py theo allnutt 2014. Gets the taxonomy ID number from a local copy of the NCBI GI/taxonomy list file #and retrieves the scientific name field from entrez. ID and name are then appended to the end of the submitted #blast output (tab format, 6) file.

Entrez.email = ‘theo.allnutt@csiro.au’
inputfile = sys.argv[1] #tab format blast output (6) GIs in second column
outputfile = sys.argv[2] #taxa are added to end columns – name of appended file
taxdb = sys.argv[3] #specify “p” or “n” for protein or nucleotide taxon list

#usage e.g. taxonfetch.py infileblast.tab outfile.tab p

f = open(inputfile,’r’)
g = open(outputfile,’w’)

c=0
dataout = “”
t=0
#get the GIs
gi=[]
dataout=[]
ids=[]

for line2 in f:
c=c+1
gi.append(line2.split(‘|’)[1])
print str(c)+” GIs”
#print gi
c=0
gitax=[]

names =[]
t=0

#CHANGE THESE PATHS TO THOSE FOR YOUR GI – TAX_ID FILES!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
if taxdb == “p”:
print “reading protein taxonomy \n”
taxaid = “/OSM/HOME-MEL/all29c/data/db/gi_taxid_prot.dmp”
if taxdb == “n”:
print “reading nucleotide taxonomy”
taxaid = “/OSM/HOME-MEL/all29c/data/db/gi_taxid_nucl.dmp”

print “searching..”
size = long(os.path.getsize(taxaid))

print “TaxID File Size: “, size
t = open(taxaid,’r’)

for gis in gi: #binary search #########################################
c=c+1
#print gis
found=False
offset=0
chunk=size
pos=chunk/2
while found == False and chunk>0:
#print “posn: “, pos
chunk = chunk/2
#print “chunk:”, chunk
#print “offset: “, offset
t.seek(pos)
t.readline()
entry = t.readline().split(“\t”)
if entry[0]:
filegi = int(entry[0])
filetax = int(entry[1].rstrip(“\n”))

#print gis, filegi, filetax
#raw_input()

if filegi == int(gis):
answer = filetax
#print filegi, ” FOUND”
found = True
#print ‘chunk:’,chunk,’ offset:’,offset,’ posn:’,pos
print c,”:”,filegi, answer
elif filegi > int(gis):
pos = offset +(chunk/2)

elif filegi < int(gis):
offset = offset+chunk
pos = pos + (chunk/2)

if found == False:
answer = "00000"
print c,":",filegi, answer, "not found"

gitax.append(str(answer))

#print gitax #list of Gi's taxids
cc1=0
for ids in gitax:
cc1=cc1+1
if ids =="00000":
names.append("No Taxon"+"\t"+ids)
print ids,"No Taxon"
else:
try:
entryData = Entrez.efetch(id = ids, db = "Taxonomy", retmode = "xml")
data1=Entrez.read(entryData) #list of dictionaries of stunning complexity – only one entry [0] required
names.append(data1[0]["ScientificName"]+"\t"+data1[0]["TaxId"]) #fetch only one at a time?
print cc1,ids, "retrieved",data1[0]["ScientificName"]
except:
names.append("not retrieved"+"\t"+ids)
print ids," not retrieved"
#add data to inputfile
n=-1
f.seek(0)
for line4 in f:
n=n+1
print n
print line4.rstrip('\n')+"\t"+names[n]+"\n"
g.write(line4.rstrip('\n')+"\t"+names[n]+"\n")

g.close()
f.close()

By: Jeff

Jeff — Mon, 31 Mar 2014 02:46:26 +0000

In reply to bfreed05. Yep, that should do it! I chose to do it in my .bash_profile but I believe the end result is the same. Down toward the end of my bash profile I have:

BLASTDB="/where/I/keep/databases"
export BLASTDB

Glad it's working for you.

By: bfreed05

bfreed05 — Sun, 30 Mar 2014 18:05:01 +0000

In reply to bfreed05. Finally figured it out. My home directory (~/) needed a file titled, " .ncbirc" with no space or letters before the period. I had to check "display hidden files" in order to see it once it was created. In this .ncbirc file, I created wrote the following text: " [BLAST] BLASTDB=~/ncbi-blast-2.2.29+/db/" and saved. After that, everything worked normally.

By: bfreed05

bfreed05 — Sun, 30 Mar 2014 01:04:37 +0000

Hello Jeff,
I’m using the taxdb system myself to characterize some environmental contigs I’ve collected. I’m using a linux machine for my first time for this, and have run into the same problem where the taxdb file is out of path.
What is the BLASTDB variable? My set up goes as follows: ~/ncbi-blast-2.2.29+/db is the folder containing my taxdb.btd and .bti file. The actual blast program runs in ~/ncbi-blast-2.2.29+/bin. Both locations are added to my PATH, but it doesn’t recognize the taxon files and leaves a NA when i request “scientific names” in my output format. taxid displays correctly. Any help is greatly appreciated.

Comments on: Somethings should be easy, and they are…

By: Audrey Bourret

By: Lavinia

By: tallnutt

By: Jeff

By: bfreed05

By: bfreed05