Using Parallel Processing for Text File Processing (and Shell Scripts)

12 February 2008

Over at Onkar Joshi’s Blog, he wrote an article about how to write a shell script to process a text file using parallel processing. He provided an example script (provided here with my commentary in the comments):

# Split the file up into parts
# of 15,000 lines each
split -l 15000 originalFile.txt
#
# Process each part in separate processes -
# which will run on separate processors
# or CPU cores
for f in x*
do
runDataProcessor $f > $f.out &
done
#
# Now wait for all of the child processes
# to complete
wait
#
# All processing completed:
# build the text file back together
for k in *.out
do
cat $k >> combinedResult.txt
done

The commentary should be fairly complete. The main trick is to split the file into independent sections to be processed, then to process the sections independently. Not all files can be processed this way, but many can.

When the file is split up in this way, then multiple processes can be started to process the parts - and thus, separate processes will be put onto separate threads and the entire process will run faster. Other wise, with a single process, the entire process would only utilize one core, with no benefit of multiple cores.

The command split may already be on your system; HP-UX has it, and Linux has everything.

This combination could potentially be used for more things: the splitting of the task into parts (separate processes) then waiting for the end result and combining things back together as necessary. Remember that each process may get a separate processor, and that only if the operating system is supporting multiple threads will this work. Linux without SMP support will not work for example.

I did something like this in a Ruby script I wrote to rsync some massive files from a remote host - but in that case, there were no multiple cores (darn). I spawned multiple rsync commands within Ruby and tracked them with the Ruby script. It was mainly the network that was the bottleneck there, but it did speed things up some - with multiple cores and a faster CPU, who knows?

Also, in these days, most every scripting language has threads (which could potentially be run on multiple CPUs). I’ll see if I can’t put something together about threading in Ruby or Perl one of these days.

Entry Filed under: Perl, Ruby, Scripting, Tips, UNIX. Tags: , , , , , , .

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


David Douthitt

David is an experienced UNIX and Linux system administrator, a former Linux distribution maintainer, and author of two books ("Advanced Topics in System Administration" and "GNU Screen: A Comprehensive Manual"). View David Douthitt's profile on LinkedIn

Recent Posts

Top Posts

RSS Sharky's Column!

Calendar

February 2008
M T W T F S S
« Jan   Mar »
 123
45678910
11121314151617
18192021222324
2526272829  

Recent Comments

bharat on The Demise of the HP-UX System…
H4mm3r on Avoiding catastrophe!
Vladimir on Argument list too long?
ddouthitt on The UNIX find command and…
Mihir G joshi on The UNIX find command and…

Category Cloud

BSD Career Debian Debugging Fedora FreeBSD HPUX Learning Linux MacOS X Mind Hacks Mobile Computing NetBSD Networking OpenBSD OpenSolaris Open Source OpenVMS Personal Notes Portable Presentations Red Hat Scripting Security Solaris Tips Ubuntu UNIX Wheel Group Windows

Archives

Feeds

Links