<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	>
<channel>
	<title>
	Comments on: Somethings should be easy, and they are&#8230;	</title>
	<atom:link href="https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/</link>
	<description>Marine Microbial Ecology</description>
	<lastBuildDate>Tue, 01 Oct 2019 18:52:02 +0000</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>
		By: Audrey Bourret		</title>
		<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-427</link>

		<dc:creator><![CDATA[Audrey Bourret]]></dc:creator>
		<pubDate>Tue, 01 Oct 2019 18:52:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.polarmicrobes.org/?p=859#comment-427</guid>

					<description><![CDATA[Thank you so much, this post allowed me to finally get taxonomy on my blast results ...]]></description>
			<content:encoded><![CDATA[<p>Thank you so much, this post allowed me to finally get taxonomy on my blast results &#8230;</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Lavinia		</title>
		<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-267</link>

		<dc:creator><![CDATA[Lavinia]]></dc:creator>
		<pubDate>Sun, 19 Apr 2015 11:25:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.polarmicrobes.org/?p=859#comment-267</guid>

					<description><![CDATA[Thanks Jeff, I was just about to wade through the BLAST manual to see how to implement this, Google led me here - exactly what I was after, thanks.]]></description>
			<content:encoded><![CDATA[<p>Thanks Jeff, I was just about to wade through the BLAST manual to see how to implement this, Google led me here &#8211; exactly what I was after, thanks.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: tallnutt		</title>
		<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-198</link>

		<dc:creator><![CDATA[tallnutt]]></dc:creator>
		<pubDate>Wed, 09 Jul 2014 01:50:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.polarmicrobes.org/?p=859#comment-198</guid>

					<description><![CDATA[Hi,

I&#039;ve had the same problem for a couple of years now, and still not really found the solution. In metagenomics, methods seem to be moving toward using online services like MGRAST, who presumably have their own taxonomy databases. However, I still like to be able to run my own analyses locally. The quickest way I&#039;ve found to do this is to use a binary search of the taxonomy db in python, see code below. I can&#039;t take credit for the algorithm, whih I got from the net somewhere, but sorry its from too many places to remember them all to give proper credit..  This retrieves the tax_id very quickly indeed.. but the slow part is then getting the scientific name from Entrez, which I have only managed to do via their server one at a time (maybe this could be speeded up doing several at a time, but I got errors trying that and gave up so far). Anyway, perhaps the code below will be useful. Theo.

from Bio import Entrez
import sys
import os

#taxonfetch.py theo allnutt 2014. Gets the taxonomy ID number from a local copy of the NCBI GI/taxonomy list file #and retrieves the scientific name field from entrez. ID and name are then appended to the end of the submitted #blast output (tab format, 6) file.

Entrez.email = &#039;theo.allnutt@csiro.au&#039;
inputfile = sys.argv[1] #tab format blast output (6) GIs in second column
outputfile = sys.argv[2] #taxa are added to end columns - name of appended file
taxdb = sys.argv[3] #specify &quot;p&quot; or &quot;n&quot; for protein or nucleotide taxon list

#usage e.g. taxonfetch.py infileblast.tab outfile.tab p

f = open(inputfile,&#039;r&#039;)
g = open(outputfile,&#039;w&#039;)

c=0
dataout = &quot;&quot;
t=0
 #get the GIs
gi=[]
dataout=[]
ids=[]

for line2 in f:
	c=c+1
	gi.append(line2.split(&#039;&#124;&#039;)[1])
print str(c)+&quot;   GIs&quot;
#print gi
c=0
gitax=[]

names =[]
t=0

#CHANGE THESE PATHS TO THOSE FOR YOUR GI - TAX_ID FILES!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
if taxdb == &quot;p&quot;:
	print &quot;reading protein taxonomy \n&quot;
	taxaid = &quot;/OSM/HOME-MEL/all29c/data/db/gi_taxid_prot.dmp&quot;
if taxdb == &quot;n&quot;:
	print &quot;reading nucleotide taxonomy&quot;
	taxaid = &quot;/OSM/HOME-MEL/all29c/data/db/gi_taxid_nucl.dmp&quot;

print &quot;searching..&quot;
size = long(os.path.getsize(taxaid))

print &quot;TaxID File Size: &quot;, size
t = open(taxaid,&#039;r&#039;)

for gis in gi: #binary search #########################################
	c=c+1
	#print gis
	found=False
	offset=0
	chunk=size
	pos=chunk/2
	while found == False and chunk&#062;0:
		#print &quot;posn: &quot;, pos
		chunk = chunk/2
		#print &quot;chunk:&quot;, chunk
		#print &quot;offset: &quot;, offset
		t.seek(pos)
		t.readline()
		entry = t.readline().split(&quot;\t&quot;)
		if entry[0]:
			filegi = int(entry[0])
			filetax = int(entry[1].rstrip(&quot;\n&quot;))
		
		#print gis, filegi, filetax
		#raw_input()
		
		if filegi == int(gis):
			answer = filetax
			#print filegi, &quot;  FOUND&quot;
			found = True
			#print &#039;chunk:&#039;,chunk,&#039;   offset:&#039;,offset,&#039;   posn:&#039;,pos
			print c,&quot;:&quot;,filegi, answer
		elif filegi &#062; int(gis):
			pos = offset +(chunk/2)
		
		elif filegi &#060; int(gis):
			offset = offset+chunk
			pos = pos + (chunk/2)
	
	if found == False:
		answer = &#034;00000&#034;
		print c,&#034;:&#034;,filegi, answer, &#034;not found&#034;
	
	
	gitax.append(str(answer))
	
#print gitax	#list of Gi&#039;s taxids
cc1=0
for ids in gitax:
	cc1=cc1+1
	if ids ==&#034;00000&#034;:
		names.append(&#034;No Taxon&#034;+&#034;\t&#034;+ids)
		print ids,&#034;No Taxon&#034;
	else:
		try:
			entryData = Entrez.efetch(id = ids, db = &#034;Taxonomy&#034;, retmode = &#034;xml&#034;)
			data1=Entrez.read(entryData) #list of dictionaries of stunning complexity - only one entry [0] required
			names.append(data1[0][&#034;ScientificName&#034;]+&#034;\t&#034;+data1[0][&#034;TaxId&#034;]) #fetch only one at a time?
			print cc1,ids, &#034;retrieved&#034;,data1[0][&#034;ScientificName&#034;]
		except:
			names.append(&#034;not retrieved&#034;+&#034;\t&#034;+ids)		
			print ids,&#034; not retrieved&#034;
#add data to inputfile
n=-1
f.seek(0)
for line4 in f:
	n=n+1
	print n
	print line4.rstrip(&#039;\n&#039;)+&#034;\t&#034;+names[n]+&#034;\n&#034;
	g.write(line4.rstrip(&#039;\n&#039;)+&#034;\t&#034;+names[n]+&#034;\n&#034;)


g.close()
f.close()]]></description>
			<content:encoded><![CDATA[<p>Hi,</p>
<p>I&#8217;ve had the same problem for a couple of years now, and still not really found the solution. In metagenomics, methods seem to be moving toward using online services like MGRAST, who presumably have their own taxonomy databases. However, I still like to be able to run my own analyses locally. The quickest way I&#8217;ve found to do this is to use a binary search of the taxonomy db in python, see code below. I can&#8217;t take credit for the algorithm, whih I got from the net somewhere, but sorry its from too many places to remember them all to give proper credit..  This retrieves the tax_id very quickly indeed.. but the slow part is then getting the scientific name from Entrez, which I have only managed to do via their server one at a time (maybe this could be speeded up doing several at a time, but I got errors trying that and gave up so far). Anyway, perhaps the code below will be useful. Theo.</p>
<p>from Bio import Entrez<br />
import sys<br />
import os</p>
<p>#taxonfetch.py theo allnutt 2014. Gets the taxonomy ID number from a local copy of the NCBI GI/taxonomy list file #and retrieves the scientific name field from entrez. ID and name are then appended to the end of the submitted #blast output (tab format, 6) file.</p>
<p>Entrez.email = &#8216;theo.allnutt@csiro.au&#8217;<br />
inputfile = sys.argv[1] #tab format blast output (6) GIs in second column<br />
outputfile = sys.argv[2] #taxa are added to end columns &#8211; name of appended file<br />
taxdb = sys.argv[3] #specify &#8220;p&#8221; or &#8220;n&#8221; for protein or nucleotide taxon list</p>
<p>#usage e.g. taxonfetch.py infileblast.tab outfile.tab p</p>
<p>f = open(inputfile,&#8217;r&#8217;)<br />
g = open(outputfile,&#8217;w&#8217;)</p>
<p>c=0<br />
dataout = &#8220;&#8221;<br />
t=0<br />
 #get the GIs<br />
gi=[]<br />
dataout=[]<br />
ids=[]</p>
<p>for line2 in f:<br />
	c=c+1<br />
	gi.append(line2.split(&#8216;|&#8217;)[1])<br />
print str(c)+&#8221;   GIs&#8221;<br />
#print gi<br />
c=0<br />
gitax=[]</p>
<p>names =[]<br />
t=0</p>
<p>#CHANGE THESE PATHS TO THOSE FOR YOUR GI &#8211; TAX_ID FILES!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />
if taxdb == &#8220;p&#8221;:<br />
	print &#8220;reading protein taxonomy \n&#8221;<br />
	taxaid = &#8220;/OSM/HOME-MEL/all29c/data/db/gi_taxid_prot.dmp&#8221;<br />
if taxdb == &#8220;n&#8221;:<br />
	print &#8220;reading nucleotide taxonomy&#8221;<br />
	taxaid = &#8220;/OSM/HOME-MEL/all29c/data/db/gi_taxid_nucl.dmp&#8221;</p>
<p>print &#8220;searching..&#8221;<br />
size = long(os.path.getsize(taxaid))</p>
<p>print &#8220;TaxID File Size: &#8220;, size<br />
t = open(taxaid,&#8217;r&#8217;)</p>
<p>for gis in gi: #binary search #########################################<br />
	c=c+1<br />
	#print gis<br />
	found=False<br />
	offset=0<br />
	chunk=size<br />
	pos=chunk/2<br />
	while found == False and chunk&gt;0:<br />
		#print &#8220;posn: &#8220;, pos<br />
		chunk = chunk/2<br />
		#print &#8220;chunk:&#8221;, chunk<br />
		#print &#8220;offset: &#8220;, offset<br />
		t.seek(pos)<br />
		t.readline()<br />
		entry = t.readline().split(&#8220;\t&#8221;)<br />
		if entry[0]:<br />
			filegi = int(entry[0])<br />
			filetax = int(entry[1].rstrip(&#8220;\n&#8221;))</p>
<p>		#print gis, filegi, filetax<br />
		#raw_input()</p>
<p>		if filegi == int(gis):<br />
			answer = filetax<br />
			#print filegi, &#8221;  FOUND&#8221;<br />
			found = True<br />
			#print &#8216;chunk:&#8217;,chunk,&#8217;   offset:&#8217;,offset,&#8217;   posn:&#8217;,pos<br />
			print c,&#8221;:&#8221;,filegi, answer<br />
		elif filegi &gt; int(gis):<br />
			pos = offset +(chunk/2)</p>
<p>		elif filegi &lt; int(gis):<br />
			offset = offset+chunk<br />
			pos = pos + (chunk/2)</p>
<p>	if found == False:<br />
		answer = &quot;00000&quot;<br />
		print c,&quot;:&quot;,filegi, answer, &quot;not found&quot;</p>
<p>	gitax.append(str(answer))</p>
<p>#print gitax	#list of Gi&#039;s taxids<br />
cc1=0<br />
for ids in gitax:<br />
	cc1=cc1+1<br />
	if ids ==&quot;00000&quot;:<br />
		names.append(&quot;No Taxon&quot;+&quot;\t&quot;+ids)<br />
		print ids,&quot;No Taxon&quot;<br />
	else:<br />
		try:<br />
			entryData = Entrez.efetch(id = ids, db = &quot;Taxonomy&quot;, retmode = &quot;xml&quot;)<br />
			data1=Entrez.read(entryData) #list of dictionaries of stunning complexity &#8211; only one entry [0] required<br />
			names.append(data1[0][&quot;ScientificName&quot;]+&quot;\t&quot;+data1[0][&quot;TaxId&quot;]) #fetch only one at a time?<br />
			print cc1,ids, &quot;retrieved&quot;,data1[0][&quot;ScientificName&quot;]<br />
		except:<br />
			names.append(&quot;not retrieved&quot;+&quot;\t&quot;+ids)<br />
			print ids,&quot; not retrieved&quot;<br />
#add data to inputfile<br />
n=-1<br />
f.seek(0)<br />
for line4 in f:<br />
	n=n+1<br />
	print n<br />
	print line4.rstrip(&#039;\n&#039;)+&quot;\t&quot;+names[n]+&quot;\n&quot;<br />
	g.write(line4.rstrip(&#039;\n&#039;)+&quot;\t&quot;+names[n]+&quot;\n&quot;)</p>
<p>g.close()<br />
f.close()</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Jeff		</title>
		<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-188</link>

		<dc:creator><![CDATA[Jeff]]></dc:creator>
		<pubDate>Mon, 31 Mar 2014 02:46:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.polarmicrobes.org/?p=859#comment-188</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-187&quot;&gt;bfreed05&lt;/a&gt;.

Yep, that should do it!  I chose to do it in my .bash_profile but I believe the end result is the same.  Down toward the end of my bash profile I have:

&lt;code&gt;BLASTDB=&quot;/where/I/keep/databases&quot;
export BLASTDB&lt;/code&gt;

Glad it&#039;s working for you.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-187">bfreed05</a>.</p>
<p>Yep, that should do it!  I chose to do it in my .bash_profile but I believe the end result is the same.  Down toward the end of my bash profile I have:</p>
<p><code>BLASTDB="/where/I/keep/databases"<br />
export BLASTDB</code></p>
<p>Glad it&#8217;s working for you.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: bfreed05		</title>
		<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-187</link>

		<dc:creator><![CDATA[bfreed05]]></dc:creator>
		<pubDate>Sun, 30 Mar 2014 18:05:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.polarmicrobes.org/?p=859#comment-187</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-186&quot;&gt;bfreed05&lt;/a&gt;.

Finally figured it out. My home directory (~/) needed a file titled, &quot; .ncbirc&quot; with no space or letters before the period. I had to check &quot;display hidden files&quot; in order to see it once it was created. In this .ncbirc file, I created wrote the following text: &quot; [BLAST]  BLASTDB=~/ncbi-blast-2.2.29+/db/&quot;  and saved. After that, everything worked normally.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-186">bfreed05</a>.</p>
<p>Finally figured it out. My home directory (~/) needed a file titled, &#8221; .ncbirc&#8221; with no space or letters before the period. I had to check &#8220;display hidden files&#8221; in order to see it once it was created. In this .ncbirc file, I created wrote the following text: &#8221; [BLAST]  BLASTDB=~/ncbi-blast-2.2.29+/db/&#8221;  and saved. After that, everything worked normally.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: bfreed05		</title>
		<link>https://www.polarmicrobes.org/somethings-should-be-easy-and-they-are/#comment-186</link>

		<dc:creator><![CDATA[bfreed05]]></dc:creator>
		<pubDate>Sun, 30 Mar 2014 01:04:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.polarmicrobes.org/?p=859#comment-186</guid>

					<description><![CDATA[Hello Jeff,
I&#039;m using the taxdb system myself to characterize some environmental contigs I&#039;ve collected. I&#039;m using a linux machine for my first time for this, and have run into the same problem where the taxdb file is out of path. 
What is the BLASTDB variable?  My set up goes as follows: ~/ncbi-blast-2.2.29+/db is the folder containing my taxdb.btd and .bti file. The actual blast program runs in ~/ncbi-blast-2.2.29+/bin. Both locations are added to my PATH, but it doesn&#039;t recognize the taxon files and leaves a NA when i request &quot;scientific names&quot; in my output format. taxid displays correctly. Any help is greatly appreciated.]]></description>
			<content:encoded><![CDATA[<p>Hello Jeff,<br />
I&#8217;m using the taxdb system myself to characterize some environmental contigs I&#8217;ve collected. I&#8217;m using a linux machine for my first time for this, and have run into the same problem where the taxdb file is out of path.<br />
What is the BLASTDB variable?  My set up goes as follows: ~/ncbi-blast-2.2.29+/db is the folder containing my taxdb.btd and .bti file. The actual blast program runs in ~/ncbi-blast-2.2.29+/bin. Both locations are added to my PATH, but it doesn&#8217;t recognize the taxon files and leaves a NA when i request &#8220;scientific names&#8221; in my output format. taxid displays correctly. Any help is greatly appreciated.</p>
]]></content:encoded>
		
			</item>
	</channel>
</rss>
