A short tutorial on Gnu Parallel

Luke Piszkin — Wed, 20 Jan 2021 05:36:12 +0000

This post comes form Luke Piszkin, an undergraduate researcher in the Bowman Lab. Gnu Parallel is a must-have utility for anyone that spends a lot of time in Linux Land, and Luke recently had to gain some Gnu Parallel fluency for his project. Enjoy!

*******

GNU parallel is a Linux shell tool for executing jobs in parallel using multiple CPU cores. This is a quick tutorial for increasing your workflow and getting the most out of your machine with parallel. You can find the current distribution here: https://www.gnu.org/software/parallel/. Please try some basic commands to make sure it is working. You will need some basic understanding of “piping” in the command line. I will describe command pipes briefly just for our purposes, but for a more detailed look please see https://www.howtogeek.com/438882/how-to-use-pipes-on-linux/. Piping data in the command line involves taking the output of one command and using it as the input for another. A basic example looks like this:

command_1 | command_2 | command_3 | …

Where the output of command_1 will be used as an input by command_2, command_2 will be used by command_3, and so on. For now, we will only need to use one pipe with parallel. Now let’s look at a basic command run in parallel.

Input: find -type f -name "*.txt" | parallel cat

Output: 
The house stood on a slight rise just on the edge of the village.
It stood on its own and looked over a broad spread of West Country farmland.
Not a remarkable house by any means - it was about thirty years old, squattist, squarish, made of brick, and had four windows set in the front size and proportion which more or less exactly failed to please the eye
The only person for whom the house was in any way special was Arthur Dent, and that was only because it happened to be the one he lived in.
He had lived in it for about three years, ever since he had moved out of London because it made him nervous and irritable

This command makes use of find to list all the .txt files in my directory, then runs cat on them in parallel, which shows the contents of each file on a new line. We can already see how this is much easier than running each command separately, i.e:

In: cat file1.txt

The house stood on a slight rise just on the edge of the village.

In: cat file2.txt

It stood on its own and looked over a broad spread of West Country farmland.

Also, notice how we do not need any placeholder for the files in the second command, because of the pipes. Now let’s take a more complicated example:

find -type f -name "*beta_gal_vibrio_vulnificus_1_100000_0__H_flex=up_*.txt" ! -name "*tally*" | parallel -j 4 python3 PEPCplots.py {} flex log

0.001759374417007663, 0.00033497120199255527, 0.9969940359705531
0.0019773468515624356, 0.00022978867370935437, 0.9969940359705531
0.001332602651915014, 0.0005953339816183529, 0.9969940359705531
0.0015118302435556904, 0.0005040931537659636, 0.9969940359705531
0.001320879258211107, 0.0006907926578169569, 0.9969940359705531
0.0016753759966792244, 0.00041583739269117386, 0.9969940359705302
0.0017187095827331082, 0.00036931151058880094, 0.9969940359705531
0.0017045099726521733, 0.00031386214441070197, 0.9969940359705531
0.001399703145023273, 0.0005196629341168314, 0.9969940359705531
0.001436129272321403, 0.0004806654291442482, 0.9969940359705531

This is an example from my research, it takes in a .txt data file and spits out some parameters that I want to put in a spreadsheet. Like before, we use find to get a list of all the files we want the second command to process. We use ! -name “*tally*” to exclude any files that have “tally” anywhere in the name because we don’t want to process those. In the second command, we have the option -j 4. This tells parallel to use 4 CPU cores, so it can run 4 commands at a time. You can check your computer specs to see how many cores you have available. If your machine has hyper-threading, then it can create virtual cores to run jobs on too. For instance, my dinky laptop only has 2 cores, but with hyper-threading I can use 4. This is another way to improve your efficiency. In the second command you also see a {} placeholder. This spot is filled by whatever the first command outputs. In this case, we need that placeholder because our input files go between other commands. You can also use parallel to run a number of identical commands at the same time. This is helpful if you have a program to run on the same file multiple times. For example:

seq 10 | parallel -N0 cat file1.txt

The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.

Here we use seq as a counting mechanism for how many times to run the second command. You can adjust the number of jobs by changing the seq argument. We include the -N0 flag, which tells parallel to ignore any piped inputs because we aren’t using the first command for inputs this time. Often, I like to include both the time shell tool and the –progress parallel option to see current job status and time for completion:

seq 10 | time parallel --progress -N0 cat file1.txt

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:4/0/100%/0.0s The house stood on a slight rise just on the edge of the village.
local:4/1/100%/1.0s The house stood on a slight rise just on the edge of the village.
local:4/2/100%/0.5s The house stood on a slight rise just on the edge of the village.
local:4/3/100%/0.3s The house stood on a slight rise just on the edge of the village.
local:4/4/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:4/5/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:4/6/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:3/7/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:2/8/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:1/9/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:0/10/100%/0.1s
0.21user 0.46system 0:00.63elapsed 108%CPU (0avgtext+0avgdata 15636maxresident)k
0inputs+0outputs (0major+12089minor)pagefaults 0swaps

And with that, you are well on your way to significantly increasing your computing throughput and using the full potential of your machine. You should now have a sufficient understanding of parallel to construct a command for your own projects, and to explore more complicated applications of parallelization. (Bonus points to whoever knows the book that I used for the text files.)

Luke Piszkin – The Bowman Lab

A short tutorial on Gnu Parallel