Maintaining an updated record of Genbank genomes

For an ongoing project I need a local copy of all the prokaryotic genomes in Genbank.  New genomes are being added an an ever-increasing rate, making it difficult to keep up by manually downloading them from the ftp site.  I finally decided to write a script that will automatically keep my local directory up to date.  I’m only interested in the ffn files, but it would be easy to manipulate the following script for any other file type.  As always suggestions for improving the script are welcome.  A word of warning… I’ve tested all the components of the script individually, but due to long download times I haven’t tested the whole thing as a single entity.

First, the necessary modules:

import subprocess
import os
from ftplib import FTP
import sys
from cStringIO import StringIO

I want the ffn files from both the complete and draft Genbank FTP directories.  Working first with complete:

bacteria = subprocess.Popen('wget -r -A *.ffn -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/', shell = True)
bacteria.communicate()

That should have created a directory tree mimicking that found on the FTP server, but containing only files ending in ffn.  Each genome directory likely contains multiple files, one for each element of the genome.  I’d like to combine them all together so that I can analyze the genome as a unit.

have = os.listdir('combined_ffn')
for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria'):
    if d+'.combined.ffn' not in have:
        subprocess.call('cat ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/'+d+'/*.ffn > combined_ffn/'+d+'.combined.ffn', shell = True)

Now moving on to the draft genomes.  This is a little trickier since genome directories will be removed from Bacteria_DRAFT once assembly and annotation is complete.  My local directory needs to reflect these changes.  To figure out what should be in the draft directory, and remove any local directories that should not longer be there:

old_stdout = sys.stdout
sys.stdout = mystdout = StringIO()

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('genbank/genomes/Bacteria_DRAFT/')
ftp.dir()
ftp.quit()

sys.stdout = old_stdout
with open('temp.txt', 'w') as good_out:
    print >> good_out, mystdout.getvalue()

for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT'):
    if d not in mystdout.getvalue():
        subprocess.call('rm -r ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d, shell = True)

Now update the local directory tree.

d_bacteria = subprocess.Popen('wget -r -A *scaffold.ffn.tgz -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/', shell = True)
d_bacteria.communicate()

Another hang up with the draft genomes is that the ffn files are all present as gzipped tarballs.  They need to be untarred before concatenation.

untar = subprocess.Popen('for d in ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/*;do cd $d;ls *.tgz | parallel "tar -xzvf {.}.tgz";rm *.tgz;cd /volumes/deming/databases/genomes_ffn;done', shell = True)
untar.communicate()

And now we’re on the home stretch…

for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT'):
    cont = False
    contents = os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d)
    for f in contents:
        if 'ffn' in f:
            cont = True
    if d+'.draft.combined.ffn' in have:
        cont = False
    if cont == True:
        subprocess.call('cat ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d+'/*.ffn > combined_ffn/'+d+'.draft.combined.ffn', shell = True)

I tried to write this script in both pure bash and pure Python, and failed in both cases.  A more experienced scripter would probably have little trouble with either, but my version does illustrate how well bash and python play together.  For the first run I executed the script manually.  It’ll probably take a couple days to download on our building’s 1 Mbps connection (come on UW, it’s 2013…).  Once things are running smooth I’ll place a call to the script in /private/etc/periodic/weekly and have an automatically updated database!

 

5726 Total Views 4 Views Today
This entry was posted in Research. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield