Predicting metabolic pathways in a metagenome

For an ongoing project I needed to predict the metabolic pathways that are present in a metagenome.  This is actually something that I’ve been interested in trying to do for a while, as metabolic pathways can tell us much more about metabolic function than genes alone.  There may be some pre-packaged methods out there for doing this, but I’ve struggled over the years to find one that doesn’t require commercial software or a subscription to a commercial database (i.e. KEGG).

I’ve worked out the following method that relies on DIAMOND, Biopython, and Pathway-Tools.  All are available for free for academic users.  I went down this path at the request of a reviewer for a submitted manuscript, if you use the method please cite the (hopefully) forthcoming work – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula.  I haven’t archived the work on a pre-press server (I’m still not sure how I feel about those), but I’m happy to provide a copy of the manuscript and/or the citation.  I’ll try to remember to add the citation here when (if!) the revised manuscript is accepted (Accepted! see link in this article).

The general strategy here is to:

1.  Shake your fist at the hard, cruel world that forced KEGG to go commercial.

2.  Create a DIAMOND database of all the coding sequences present in all completed Genbank genomes.

3.  Use DIAMOND blastx to search all the metagenomic reads against this database.

4.  Create an artificial Genbank format sequence record containing all the features that had a hit.  To avoid time and memory constraints this record will contain only unique BRENDA EC identifiers and gene products.

5.  Use pathway tools to predict the metabolic pathways in this artificial “genome”.

A word of warning… I hacked this together in a non-linear fashion, thus the code presented here is NOT exactly what I ran.  There might be some errors, so if you try this and it doesn’t run just take a deep breath and track down the errors.  Feel free to ping me if you can’t find the problem.  Okay, here we go!

Download the complete genomes from Genbank and go get a fresh cup of coffee:

wget -r -A *.gbk -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/

Next, extract all the feature translations from these Genbank files (yes, you could just download the faa files, but I had reasons for needing to do it this way).  Pay attention to directory locations (here and in the other scripts)!

### user setable variables ###

ref_dir = '/volumes/hd1/ref_genome_database/ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/' # location of the database directory

### end user setable variables ###

import os  
from Bio import SeqIO  

with open('all_genomes.fasta', 'w') as fasta_out, open('all_genomes_map.txt', 'w') as map_out:    
    for d in os.listdir(ref_dir):    
        for f in os.listdir(ref_dir + d):
            if f.endswith('.gbk'):                
                gbk_name = f
                
                for record in SeqIO.parse(ref_dir + d + '/' + gbk_name, 'genbank'):
                    parent = record.seq
                    
                    for feature in record.features:
                        if feature.type == 'CDS':
                        
                            start = int(feature.location.start)
                            end = int(feature.location.end)
                            strand = feature.location.strand
                            try:
                                trans = feature.qualifiers['translation'][0]    
                                feature_id = (start, end, strand)
                                
                                if strand == -1:
                                    sequence = parent[start:end].reverse_complement()
                                else:
                                    sequence = parent[start:end]
                                
                                seqname = record.id + '_' + str(start) + '_' + str(end) + '_' + str(strand)
                                print start, end, sequence[0:10]
                                print >> fasta_out, '>' + seqname + '\n' + trans
                                print >> map_out, d, f, seqname
    
                            except KeyError:
                                continue

Make a DIAMOND database:

diamond makedb --in all_genomes.fasta -d all_genomes_diamond

And run DIAMOND blastx (so fast!).  This took me about an hour with 24 cpus and 32 Gb RAM for a metagenome with 22 million reads…

diamond blastx -d all_genomes_diamond -q test_mg.good.fasta -o test_mg.good.diamond.txt -k 1

Now create the Genbank file containing all features that have a hit.

ref_dir = 'ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/'

import os
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

map_dict = {}

with open('all_genomes_map.txt', 'r') as map_in:
    for line in map_in:
        line = line.rstrip()
        line = line.split()
        map_dict[line[2]] = line[0]

hits = set()
genomes = set()

with open('test_mg.good.diamond.txt', 'r') as diamond:
    for line in diamond:
        line = line.rstrip()
        line = line.split()
        evalue = float(line[10])
        hit = line[1]
        if evalue < 1e-5:
            hits.add(hit)
            print 'hit =', hit
            genome = map_dict[hit]
            genomes.add(genome)

with open('all_genomes_nr/genetic-elements.dat', 'w') as all_genetic_elements:
    print >> all_genetic_elements, 'ID' + '\t' + 'all_genomes_nr'
    print >> all_genetic_elements, 'NAME' + '\t' + 'all_genomes_nr'
    print >> all_genetic_elements, 'TYPE' + '\t' + ':CHRSM'
    print >> all_genetic_elements, 'CIRCULAR?' + '\t' + 'N'
    print >> all_genetic_elements, 'ANNOT-FILE' + '\t' + 'all_genomes_nr.gbk'
    print >> all_genetic_elements, '//'  
    
with open('all_genomes_nr/organism-params.dat', 'w') as all_organism_params:
    print >> all_organism_params, 'ID' + '\t' + 'all_genomes'
    print >> all_organism_params, 'Storage' + '\t' + 'File'
    print >> all_organism_params, 'Name' + '\t' + 'all_genomes'
    print >> all_organism_params, 'Rank' + '\t' + 'Strain'
    print >> all_organism_params, 'Domain' + '\t' + 'TAX-2'
    print >> all_organism_params, 'Create?' + '\t' + 't'

good_features = []
ec = set()
products = set()
    
for d in os.listdir(ref_dir):
    if d in genomes:
        for f in os.listdir(ref_dir + d):        
            for record in SeqIO.parse(ref_dir + d + '/' + f, 'genbank'):            
                for feature in record.features:
                    keep = False
                    if feature.type == 'CDS':
                        try:
                            temp_ec = feature.qualifiers['EC_number']
                            for each in temp_ec:
                                if each not in ec:
                                    keep = True
                                    ec.add(each)
                                    print feature.type, each
                            if keep == True:
                                good_features.append(feature)
                        except KeyError:
                            try:
                                temp_product = feature.qualifiers['product'][0]
                                if temp_product not in products:
                                    good_features.append(feature)
                                    products.add(temp_product)
                                    print feature.type, temp_product
                            except KeyError:
                                continue
            

new_record = SeqRecord(Seq('nnnn', alphabet = IUPAC.ambiguous_dna), id = 'all_genomes_nr', name = 'all_genomes_nr', features = good_features) 
SeqIO.write(new_record, open('all_genomes_nr/all_genomes_nr.gbk', 'w'), 'genbank')

Once you’ve waded your way through that mess you can run the Pathos program in Pathway-Tools from the command line as such:

pathway-tools -lisp -patho all_genomes_nr/ -disable-metadata-saving &> pathos_all_genomes_nr.log

That will take a little while to run.  When it’s complete you can navigate to where Pathway-Tools store your user-generated pathways (probably something like home/you/ptools-local/pgdbs/user).  The pathways will be in the all_genomes_nycyc directory.  The best way to get at that information is to parse the friendly pathway-reports.txt file.

Update 6/5/15 – I was stuck in file parsing mode when I wrote that last paragraph.  Clearly the best way to “get at that information” is to fire up Pathway-Tools, which will let you view all the pathways by hierarchy, and their associated graphics.

3917 Total Views 1 Views Today
This entry was posted in Research. Bookmark the permalink.

6 Responses to Predicting metabolic pathways in a metagenome

  1. Do you have any links to the files you produce here? More specifically, what is the format of the file that you give to pathway-tools?

    • Jeff Jeff says:

      Brandon,
      No links, and the format of the files are simply those specified by the pathway-tools module pathologic. To make this easier I’ve built the method into the paprica pipeline that I’ve been working on. Use the paprica-mg_build.py and paprica-mg_run.py scripts available from the development version of paprica on Github. Note that this requires you to build the paprica database first, which takes some dependencies and time. If you don’t want to do that you can deconstruct the paprica-mg.py script, but I’d recommend building the database. Be aware that at this time there is a bug in pathway-tools that prevents PGDBs being built in parallel. I think the pathway-tools people are working on this and hopefully there will be a fix out very soon. Until this is fixed it takes a very long time to build the paprica database.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments Protected by WP-SpamShield for WordPress