Comments on: Some things should be easy…

By: Jeff

Jeff — Wed, 04 Feb 2015 22:02:31 +0000

In reply to mialian.

Mialian,
That’s not c, sfederman’s still in Python at that point. If you replace the c.execute line in the Option #3 code with that suggested above it should work. Alternative try this method instead: https://www.polarmicrobes.org/?p=859.

By: mialian

mialian — Wed, 21 Jan 2015 10:21:19 +0000

In reply to sfederman.

sfederman,
I would like to implement your suggestion to (hopefully) speed up the process further, but I don’t understand exactly how to implement it. I am not used to c-code. Could you show where and how you would implement it in Jeff’s code!?

Thanks in advance!

By: Jeff

Jeff — Thu, 04 Jul 2013 16:57:05 +0000

If you are interested in setting up a taxonomy database please read this post describing a better method: https://www.polarmicrobes.org/?p=859

By: Jeff

Jeff — Tue, 18 Jun 2013 18:31:39 +0000

In reply to Jeff.

Further modifications… if you use the above fgrep –max-count=1 method you will invariably incorrectly match some lines in the .dmp file (wherever your query gi appears in a target gi). Since –max-count=1 aborts the search after one match you will often not return the line you want. Use grep(z) -P “23435/t” instead. The -P option uses Perl regex, allowing the use of the \t character. This example would match:

23435 46
123435 893

but not:

234357 21456

This effectively reduces the .dmp file to a size that allows dictionary creation, but it will contain a number of erroneous matches (not a problem, because the dictionary will force exact matches downstream).

By: sfederman

sfederman — Thu, 06 Jun 2013 06:34:27 +0000

I've also found this post very helpful in setting up a taxonomy database. Thanks for putting me on the right path. I want to let you know one adjustment I made in order to speed up the queries. When creating the database, I adjusted your script above slightly:

c.execute('''CREATE TABLE gi_taxid (
			gi INTEGER PRIMARY KEY,
			taxid INTEGER)''')

This makes the gi a primary key, and indexes this field in the file - greatly speeding up searches.

By: Jeff

Jeff — Sun, 26 May 2013 17:29:36 +0000

Glad it helped! I’ve made the following two modifications to my use of grep, now using “zgrep -F”. Zgrep allows a search on a gzipped dmp file (dmp.gz) which saves a little disk space. The -F parameter (same as fgrep for search on a non-gzipped file) does not interpret regular expressions (not needed here), speeding up the grep search.

By: cjcook

cjcook — Wed, 22 May 2013 17:03:35 +0000

Thank you so much for posting this. I’ve been struggling with the same problem for a couple of weeks now and it never occured to me to grep for it!