Top 10 scripting tricks for basic bioinformatics

In the process of preparing some data for publication I needed to revisit some scripts I wrote several months ago, clean them up, and re-run them. I was pretty amazed with how bad they were, and ended up spending a couple of days rewriting them (necessary? No… but it felt good). Many of the changes between the old versions and the new version came down to the implementation of a relatively small number of tools. Here are the top ten elements present in scripts I write now that have resulted in huge performance improvements over scripts I wrote, say, a year ago. I can only hope the pace of improvement continues. The elements are a mix of Python and shell commands or data structures. If you’re an experienced hand at writing scripts these are pretty basic tips, but if you’re just starting out you might find them useful. If you’ve got a life-saving trick of your own leave a comment.

Number 10: the tab delimited format

I’ll be honest. The xml format scares me, but not as bad as the newick format and some of the other more obscure “standard” data formats that crop up in bioinformatics. Perhaps with more experience I’ll get over this, but for now nothing makes me happier than the tab-delimited format. Easy to read, easy to parse. Even some formats developed by people sophisticated enough to come up with something more complex are tab-delimited (the SAM format, for example). The three tools I use the most on a day to day basis are bash, Python, and R. With everything tab-delimited they all play nice together. I try to keep as much data in this format as possible.

Number 9: head and tail

Not a scripting technique per se, and one of the first shell commands most people learn. Have a 5 G tab-delimited file of unknown columns? Use head -X to quickly access the first X number of lines. Need a small test file for your next script? Redirect that same head command to a file and you’re all set. Same applies to tail, except it returns the bottom X lines.

Number 8: the redirect

Also one of the first shell techniques most people learn. Redirects allow you to send files to shell commands and capture output. For example if you want a file of all the sequence names in a very large fasta, grep and a redirect will do the trick:

fgrep '>' big.fasta > big.fasta.names.txt

Related to the redirect is the pipe. You can send the output of one command straight to another with the “|” character. For example if you need to concatenate thousands of fasta files and then compress them:

cat *.fasta | gzip

This eliminates the large, uncompressed intermediate file.

Number 7: dictionaries

I didn’t really get dictionaries when I took a class on Python as a new graduate student. I understood what they did, but not how that was particularly useful. In the last script I wrote I think I implemented four different dictionaries, mapping gi number to ncbi id, gi to taxonomic id, ncbi id to taxonomy, and finally actual sequence information to taxonomy (or something along those lines). This multi-step mapping would have been extremely complex and inefficient using a different data structure. One conceptual breakthrough came when I realized dictionary values could be anything – lists, sets, even other dictionaries. It’s a great way to organize and quickly access a lot of information.

Number 6: parallel

Parallel computing baffles me a little. I have about a 50 % success rate implementing parallel cluster computing on various projects using MPI, and a 0 % success rate implementing the Python parallel computing modules. The gnu “parallel” command is a cozy warm blanket in the cold hard world of parallel computing. If you have multiple cpus on the same motherboard (for example our lab’s MacPro workstation) and a job you can easily split into many small jobs, parallel allows simultaneous execution of as many jobs as you have cores. It works great on loops in bash shells and also on multiple commands in a bash shell script. I’ve found the simplest way to execute parallel is to use Python to split my job and create a bash script with one line per job (each line might contain multiple commands). Redirecting the script into parallel results in parallel execution. Parallel is smart enough to capture and sort output from all the threads, even if its going to the same place.

parallel < my_awesome_script.sh > my_dissertation.txt

Number 5: grep

Another basic shell command. Grep, fgrep, and zgrep are magic commands that can extract critical lines of data based on pattern matching. If you aren’t using a nested file format (if you are using tab-delimited or similar) the grep commands can quickly subset a massive file to just those lines you need. It’s worth looking at all the options available for these commands, which make them quite powerful.

Number 4: “vectorized” commands

I’m not sure if “vectorized commands” is the right term, but I’ve heard that applied to R which discourages the use of loops and relies very heavily on commands that iterate intrinsically. Python tends to rely more heavily on loops, but it has a number of these intrinsic iterators that are much, much faster than loops. For example to match a string to a list of strings you might be tempted to use a loop:

for s in some_list:
     if my_string == s:
          do something

However the “in” command will perform much faster:

if my_string in some_list:
     do something

Number 3: gzip

Bioinformatics is big files. You can’t escape from them. The gigabytes keep growing until they’re terabytes, and then I don’t really want to think about what happens. And of course you have to move all these files from computer to computer and try to do things with them. Python (using the gzip module) and R (natively) both play well with gzip. So does zgrep. I try to keep everything larger than a couple of megabytes in the gzip format. This saves a ton of space and of course file transfers are faster.

Number 2: with

With is another Python command that I didn’t really get at first. Then came the day that I needed to open a many gigabyte file in Python and couldn’t do it. The “with open(…) as handle:” command opens the file for reading without holding the whole thing in memory. Add an iterator to loop across the lines. Downside, it adds one more indentation level to that block of your code. Upside, you can add multiple “open as” segments to the with line:

with open('file_1.txt', 'r') as file1, \
open('file_2.txt.gz', 'rb') as file2, \
open('output.txt', 'w') as output:
     for line in file1:
          do lots of stuff

Number 1: the set

The single most useful scripting trick I’ve learned over the last year is the set. Sets are containers of unique, unordered elements. Searching across a set is lightning fast compared to searching a across a list. Sets convert to and from ordered lists easily enough if ordering is important and duplicate entries are not. I almost never search across Python lists anymore. Sets have allowed me to create some scripts that will run overnight that would have taken weeks or months with lists.

Happy scripting!

20637 Total Views 10 Views Today

Top 10 scripting tricks for basic bioinformatics

Related

2 Responses to Top 10 scripting tricks for basic bioinformatics

Leave a Reply Cancel reply

Post categories

Most recent posts

Copyright

Disclaimer

Meta