Parallel wget with xargs or parallel
The following is an illustration of how to use xargs to conduct parallel operations on single-threaded applications, specifically wget.
GNU wget is a great tool for downloading content from websites. The wget command is a non-interactive network downloader; by "non-interactive" what is meant is that it can be run in the background. Some very hand options include -c (continue, for partially downloaded files), -m (mirror, for an entire website), -r --no-parent (recursive, no parent, to download part of a website and its subdirectories). The cURL application has a wider range of protocols and includes upload options, but is non-recursive.
Recently, I had the need to download a small number of PDF files. The wildcard-based approach would be:
$ wget -r -nd --no-parent -A 'rpgreview_*.pdf' http://rpgreview.net/files/
The -r and --no-parent options have already been explained. The -nd option allows one to save all files to the current directory, without hierarchy of directories. The -A option ('accept', or -R 'reject) allows one to specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, or ranges [] to be in an acclist or rejlist.
Running the above has the following time:
real 2m19.353s
user 0m0.836s
sys 0m2.998s
An alternatiive, looping through each file one at a time, would have been something like:
for issue in {1..53}
do
wget "https://rpgreview.net/files/rpgreview_$issue.pdf"
done
(Just for the record, wget can get a bit gnarly when dealing with http requests because for some webservers there is no requirement for path delimiters to match directory delimiters. For the purposes of this discussion it is assumed that we're dealing with a rational being where the two are equivalent.)
Using a combination of the printf command and the xargs command a list of the URLs can be constructed which is then passed to xargs which can split the list to run in parallel.
By itself, xargs simply reads items from standard input, delimited by blanks or newlines, and executes commands from that list of items as arguments. This is somewhat different to the pipe command which, by itself, sends the output of one command as the input stream to another. In contrast, xargs takes data from standard input and executes a command which, by default, the data is appended to the end of the command as an argument. However, the data can be inserted anywhere however, using a placeholder for the input; the typical placeholder is {}.
The value -P 8 is entirely arbitrary here and should be modified according to available resources. Adding -nc prevents xargs attempting to download the same file more than once (wget will not overwrite an existing file, but rather append a new file with .1, etc. The -n option ensures that only one argument is run per process.
printf "https://rpgreview.net/files/rpgreview_%d.pdf\n" {1..53} | xargs -n 1 -P 8 wget -q -nc
The time of the above comes to:
real 1m23.534s
user 0m1.567s
sys 0m2.732s
Yet another choice is to use GNU parallel and seq.
seq 53 | parallel -j8 wget "https://rpgreview.net/files/rpgreview_{}.pdf"
real 1m57.647s
user 0m1.830s
sys 0m4.214s
A final option, most common in high-performance computing systems with job schedulers, is to make use of a job array. This is effective assuming resource availability. This is a very effective option if each task in the array takes more than a couple of minutes each (given that there is an overhead involved in constructing the job, submitting it to the queue, etc). In Slurm, a script the directives and code would look like the following:
#SBATCH --job-name="file-array"
#SBATCH --ntasks=1
#SBATCH --time=0-00:15:00
#SBATCH --array=1-53
wget "https://rpgreview.net/files/rpgreview_${SLURM_ARRAY_TASK_ID}.pdf"