BarCamp Chicago 2008: Afterword

BarCamp Chicago wrapped up nicely yesterday with a number of talks. There was a talk about Python (I still don’t get why folks aren’t using Ruby, but that’s just me), an open source hardware project demo, a talk on wikis, a talk on couchdb – very nice indeed.

The open source hardware project is called Arduino and is available prebuilt for a minimal price (about US$30 to US$40) – though you could build it yourself if you like (the diagrams are online and available to all). An accelerometer was attached to the Arduino device (which was attached to the computer via USB) and the outputs printed out on the console.

The wiki talk covered what it took to install a wiki and the speaker’s experience with wikis (and MediaWiki in particular).

The couchdb talk discussed couchdb (which was particularly pertinent, because it runs using Erlang, discussed earlier). Couchdb is a database which is based on documents and uses RDF for everything, and which can be spread out among a set of computers quite easily. Note that it is not relational, and it is not object-oriented either.

And of course, what is BarCamp Chicago without Ron May?

Using Parallel Processing for Text File Processing (and Shell Scripts)

Over at Onkar Joshi’s Blog, he wrote an article about how to write a shell script to process a text file using parallel processing. He provided an example script (provided here with my commentary in the comments):

# Split the file up into parts
# of 15,000 lines each
split -l 15000 originalFile.txt
#
# Process each part in separate processes -
# which will run on separate processors
# or CPU cores
for f in x*
do
runDataProcessor $f > $f.out &
done
#
# Now wait for all of the child processes
# to complete
wait
#
# All processing completed:
# build the text file back together
for k in *.out
do
cat $k >> combinedResult.txt
done

The commentary should be fairly complete. The main trick is to split the file into independent sections to be processed, then to process the sections independently. Not all files can be processed this way, but many can.

When the file is split up in this way, then multiple processes can be started to process the parts – and thus, separate processes will be put onto separate threads and the entire process will run faster. Other wise, with a single process, the entire process would only utilize one core, with no benefit of multiple cores.

The command split may already be on your system; HP-UX has it, and Linux has everything.

This combination could potentially be used for more things: the splitting of the task into parts (separate processes) then waiting for the end result and combining things back together as necessary. Remember that each process may get a separate processor, and that only if the operating system is supporting multiple threads will this work. Linux without SMP support will not work for example.

I did something like this in a Ruby script I wrote to rsync some massive files from a remote host – but in that case, there were no multiple cores (darn). I spawned multiple rsync commands within Ruby and tracked them with the Ruby script. It was mainly the network that was the bottleneck there, but it did speed things up some – with multiple cores and a faster CPU, who knows?

Also, in these days, most every scripting language has threads (which could potentially be run on multiple CPUs). I’ll see if I can’t put something together about threading in Ruby or Perl one of these days.

UPDATE:: Fixed links to article and blog (thanks for the tip, Onkar!).

What to do when the system libraries go away…

You’ve been hacking away at this system (let’s be positive and upbeat and say it’s a test system and not production). Through a slip of the fingers, you move the system libraries out of the way – all of them. Now nothing can find the libraries. Now what? Is everything lost?

Don’t despair! You can do a lot without libraries. Already loaded software has the libraries in memory, so that is okay. This includes the shell, so the shell should be okay.

There may be some statically compiled binaries on the system that don’t require libraries; these can be run. If a scripting language like perl or ruby is statically compiled, then all is well – these languages can do anything, and can replace binaries (temporarily) such as mv, cp, and others. However, since vi is probably not statically linked, you may have to do it at the command line (and not in an editor).

Here are some things one can do:

echo *

Through the use of the shell’s filename expansion, this works out to a reasonable imitation of ls (ls -m, in particular). If you have to empty a file (make the contents nothing), use this command:

> file

Every standard utility today is dynamically linked; this means that in situations like these you are stuck with only what the shell itself provides. Remember that things like cat, ls, mv, cp, vi, rm, ln, and so on are all system executables – and quite possibly dynamically linked.

The best thing for a situation like this is to have prepared in advance – have a copy of busybox handy, and possibly a statically compiled perl or ruby (or both). Don’t forget editors – either have a copy of e3 or of a statically compiled editor. Busybox provides all the standard utilities in one statically created binary, and e3 is an editor that is tiny (and i386-specific) which emulates vi, pico, wordstar, and emacs (based on its name).  Neither busybox nor e3 require additional libraries.

A good tool (and a good tool in case of security breach) is a small CDROM of tools, all statically linked for your environment. Such a disk requires no libraries at all – and could have all of these necessary tools and more.

Of course, the best thing is to avoid doing this kind of thing in the first place…