Over at Onkar Joshi’s Blog, he wrote an article about how to write a shell script to process a text file using parallel processing. He provided an example script (provided here with my commentary in the comments):
# Split the file up into parts
# of 15,000 lines each
split -l 15000 originalFile.txt
# Process each part in separate processes -
# which will run on separate processors
# or CPU cores
for f in x*
runDataProcessor $f > $f.out &
# Now wait for all of the child processes
# to complete
# All processing completed:
# build the text file back together
for k in *.out
cat $k >> combinedResult.txt
The commentary should be fairly complete. The main trick is to split the file into independent sections to be processed, then to process the sections independently. Not all files can be processed this way, but many can.
When the file is split up in this way, then multiple processes can be started to process the parts – and thus, separate processes will be put onto separate threads and the entire process will run faster. Other wise, with a single process, the entire process would only utilize one core, with no benefit of multiple cores.
The command split may already be on your system; HP-UX has it, and Linux has everything.
This combination could potentially be used for more things: the splitting of the task into parts (separate processes) then waiting for the end result and combining things back together as necessary. Remember that each process may get a separate processor, and that only if the operating system is supporting multiple threads will this work. Linux without SMP support will not work for example.
I did something like this in a Ruby script I wrote to rsync some massive files from a remote host – but in that case, there were no multiple cores (darn). I spawned multiple rsync commands within Ruby and tracked them with the Ruby script. It was mainly the network that was the bottleneck there, but it did speed things up some – with multiple cores and a faster CPU, who knows?
Also, in these days, most every scripting language has threads (which could potentially be run on multiple CPUs). I’ll see if I can’t put something together about threading in Ruby or Perl one of these days.
UPDATE:: Fixed links to article and blog (thanks for the tip, Onkar!).
3 thoughts on “Using Parallel Processing for Text File Processing (and Shell Scripts)”
Consider having a look at Parallel https://savannah.nongnu.org/projects/parallel/ It makes the script more readable.
split -l 15000 originalFile.txt
ls x* | parallel runDataProcessor >> combinedResult.txt
If the order of the input needs to be kept in the output:
ls x* | parallel -k runDataProcessor >> combinedResult.txt
If runDataProcessor is CPU intensive, run one for each core:
ls x* | parallel -j+0 runDataProcessor >> combinedResult.txt
I moved to my own domain. Would be very nice if you could change the two links in the article above to the new one.
The new GNU Parallel has a function made for this kind of task so you can avoid the temporary files completely. It only requires that runDataProcessor can read from stdin (standard input):
cat originalFile.txt | parallel -k –pipe –block 10M runDataProcessor > combinedResult.txt
To learn more watch the video about –pipe: http://www.youtube.com/watch?v=1ntxT-47VPA