news

[Published in Open Source For You (OSFY) magazine, October 2013 edition.]

GNU Parallel is a tool for running jobs in parallel in a Bash environment. The job can be a single command or a script, with variable arguments. The simultaneous execution can occur on remote machines as well. Released under the GPLv3+ license, you can install it on Fedora using the following command:

$ sudo yum install parallel

After installation, you need to remove ’–tollef’ from the /etc/parallel/config file, if it is present. This option will be permanently removed in future releases.

GNU Parallel takes a command and a list of arguments for processing. The arguments are provided in the command line after the notation ’:::’, and the command is executed for each argument. For example:

$ parallel echo ::: alpha beta gamma
alpha
beta
gamma

You can pass multiple arguments to GNU parallel, and it will run the command for every combination of the input, as shown below:

$ parallel echo ::: 0 1 ::: 0 1
0 0
0 1
1 0
1 1

The order in the output may be different. The tool provides a number of replacement string options. The default string ‘{}’ represents the input:

$ parallel echo {} ::: /tmp
/tmp

The replacement string ‘{/}’ removes everything up to and including the last forward slash:

$ parallel echo {/} ::: /tmp/stdio.h
stdio.h

If you want to return the path only, use the ‘{//}’ string:

$ parallel echo {//} ::: /tmp/stdio.h
/tmp

The string ‘{.}’ removes any filename extension:

$ parallel echo {.} ::: /tmp/stdio.h
/tmp/stdio

The output of a GNU Parallel command may not necessarily be in the order in which the input arguments are listed. For example:

$ parallel sleep {}\; echo {} ::: 5 2 1 4 3
1
2
4
3
5

If you wish to enforce the order of execution, use the ’-k’ option, as shown below:

$ parallel -k sleep {}\; echo {} ::: 5 2 1 4 3
5
2
1
4
3

A test script, for example, may need to be run ‘N’ times for the same argument. This can be accomplished with the following code:

$ seq 10 | parallel -n0 echo "Hello, World"
Hello, World
Hello, World
Hello, World
Hello, World
Hello, World
Hello, World
Hello, World
Hello, World
Hello, World
Hello, World

The ’-n’ option represents the maximum number of arguments in the command line.

The commands that will get executed by GNU Parallel can be observed with the ’–dry-run’ option, as illustrated below:

$ parallel --dry-run -k sleep {}\; echo {} ::: 5 2 1 4 3
sleep 5; echo 5
sleep 2; echo 2
sleep 1; echo 1
sleep 4; echo 4
sleep 3; echo 3

The ’–eta’ option will give an estimate on the time it will take to complete a job:

$ parallel --eta -k sleep {}\; echo {} ::: 5 2 1 4 3

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 5s 1left 1500avg  local:1/4/100%/1.0s 
2
1
4
3
ETA: 1s 0left 1.00avg  local:0/5/100%/1.0s 

Suppose you have a large number of log files that you wish to zip and archive, you can run the gzip command in Parallel, as shown below:

$ parallel gzip ::: *.log

To unzip them all, you can use the following command:

$ parallel gunzip ::: *.gz

The ‘convert’ command is useful to transform image files. High resolution images can be scaled to a lower resolution using the following command:

$ convert -resize 512x384 file.jpg file_web.jpg

If you have a large number of files that you wish to resize, you can parallelize the task, as shown below:

$ find . -name '*.jpg' | parallel convert -resize 512x384 {} {}_web.jpg

GNU Parallel with wget can help in parallel downloads of large Linux kernel releases, as shown below:

$ parallel wget ::: www.kernel.org/pub/linux/kernel/v3.x/linux-3.11.tar.xz \
                    www.kernel.org/pub/linux/kernel/v3.x/linux-3.10.10.tar.xz

The URLs can also be stored in a text file (“input.txt”), and passed as an argument to Parallel:

$ parallel -a input.txt wget

The file “input.txt” contains:

https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.11.tar.xz
https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.10.10.tar.xz

The downloaded kernel images can also be extracted in Parallel:

$ find . -name \*.tar.xz | parallel tar xvf

A ‘for’ loop in a Bash script can be parallelised. In the following script, the file sizes of all the text files are printed:

#!/bin/sh

for file in `ls *.txt`; do
  ls -lh "$file"
done | cut -d' ' -f 5

The parallelized version is as follows:

$ ls *.txt | parallel "ls -lh {}" | cut -d' ' -f 5

The number of CPUs and cores in your system can be listed with GNU Parallel:

$ parallel --number-of-cpus
1
$ parallel --number-of-cores
4

The ’-j’ option specifies the number of jobs to be run in parallel. If the value 0 is given, GNU Parallel will try to start as many jobs as possible. The ‘+ N’ option with ’-j’ adds N jobs to the CPU cores. For example:

$ find . -type f -print | parallel -j+2 ls -l {}

The input to GNU parallel can also be provided in a tabular format. Suppose you want to run ping tests for different machines, you can have a text file with the first column indicating the ping count, and the second column listing the hostname or the IP address. For example:

$ cat hosts.txt 
1 127.0.0.1
2 localhost

You can run the tests in parallel using the following code:

$ parallel -a hosts.txt --colsep ' ' ping -c {1} {2}

PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.074 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.074/0.074/0.074/0.000 ms

PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.035 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=2 ttl=64 time=0.065 ms

--- localhost.localdomain ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.035/0.050/0.065/0.015 ms

GNU Parallel can also execute jobs on remote machines, for which you need to first test that ssh works:

$ SERVER1=localhost
$ ssh $SERVER1 echo "Eureka"
guest@localhost's password: 
Eureka

You can then invoke commands or scripts to be run on SERVER1, as shown below:

$  parallel -S $SERVER1 echo "Eureka from " ::: $SERVER1
guest@localhost's password: 
Eureka from localhost

Files can also be transferred to remote machines using the ’–transfer’ option. Rsync is used internally for the transfer. An example is shown below:

$  parallel -S $SERVER1 --transfer cat ::: /tmp/host.txt 
guest@localhost's password: 
1 127.0.0.1
2 localhost

Refer to the GNU Parallel tutorial and manual page for more options and examples.