Using Parallel Processing for Text File Processing (and Shell Scripts)

12 February 2008

Over at Onkar Joshi’s Blog, he wrote an article about how to write a shell script to process a text file using parallel processing. He provided an example script (provided here with my commentary in the comments):

# Split the file up into parts
# of 15,000 lines each
split -l 15000 originalFile.txt
#
# Process each part in separate processes -
# which will run on separate processors
# or CPU cores
for f in x*
do
runDataProcessor $f > $f.out &
done
#
# Now wait for all of the child processes
# to complete
wait
#
# All processing completed:
# build the text file back together
for k in *.out
do
cat $k >> combinedResult.txt
done

The commentary should be fairly complete. The main trick is to split the file into independent sections to be processed, then to process the sections independently. Not all files can be processed this way, but many can.

When the file is split up in this way, then multiple processes can be started to process the parts – and thus, separate processes will be put onto separate threads and the entire process will run faster. Other wise, with a single process, the entire process would only utilize one core, with no benefit of multiple cores.

The command split may already be on your system; HP-UX has it, and Linux has everything.

This combination could potentially be used for more things: the splitting of the task into parts (separate processes) then waiting for the end result and combining things back together as necessary. Remember that each process may get a separate processor, and that only if the operating system is supporting multiple threads will this work. Linux without SMP support will not work for example.

I did something like this in a Ruby script I wrote to rsync some massive files from a remote host – but in that case, there were no multiple cores (darn). I spawned multiple rsync commands within Ruby and tracked them with the Ruby script. It was mainly the network that was the bottleneck there, but it did speed things up some – with multiple cores and a faster CPU, who knows?

Also, in these days, most every scripting language has threads (which could potentially be run on multiple CPUs). I’ll see if I can’t put something together about threading in Ruby or Perl one of these days.

Entry Filed under: Perl, Ruby, Scripting, Tips, UNIX. Tags: , , , , , , .

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


David Douthitt

David is an experienced UNIX and Linux system administrator, a former Linux distribution maintainer, and author of two books ("Advanced Topics in System Administration" and "GNU Screen: A Comprehensive Manual"). View David Douthitt's profile on LinkedIn Support freedom The Internet Traffic Report monitors the flow of data around the world. It then displays a value between zero and 100. Higher values indicate faster and more reliable connections.

Recent Posts

Top Posts

RSS Sharky’s Column!

Calendar

February 2008
M T W T F S S
« Jan   Mar »
 123
45678910
11121314151617
18192021222324
2526272829  

Recent Comments

Anthony on About
MikeT on Stress Relief: Laugh Out Loud…
yungchin on Sparse files – what, why…
Randal L. Schwartz on Perl Tidbits: Annoyances and…
Court on Perl Tidbits: Annoyances and…

Category Cloud

BSD Career Conferences Debian Debugging Disaster recovery Fedora FreeBSD HP-UX Legal Linux MacOS X Mobile Computing Networking OpenBSD OpenSolaris OpenVMS Personal Notes Portable Code Presentations Productivity Programming Red Hat Scripting Security Solaris Storage Tips Ubuntu UNIX

Archives

Feeds

Blogroll

Pages

Meta