Personal Backups: A Lesson in Computing Safety

Over at the Daring Fireball blog, John Gruber has a nice article about how his extensive backup saved him recently from losing data on a hard drive that died. I know the utilities he speaks of (SuperDuper and DiskWarrior) and can vouch for their usefulness on Macintoshes (although I prefer to use psync instead of SuperDuper).

It was Merlin Mann over at 43Folders who noticed the article and then wrote his own take on John’s article.

For Linux and Unix, backups are much more varied. Two of the most widely known programs are Bacula and Amanda, both enterprise-level backup tools. For personal use, I prefer to use rsync to make copies of my home directories. There are a large number of tools that use rsync to make backups; one tool is from Mike Rubel from way back in 2004, with a comprehensive article titled Easy Automated Snapshot-Style Backups with Linux and Rsync. Another good article is from Joe Brockmeier on Linux.com titled Back up like an expert with rsync.

One popular (and simple) backup program is Dirvish, which I believe uses rsync behind the scenes. I’ve used the KDE app keep before, which was an easy and pleasant experience. The program rdiff-backup is also commonly recommended.

Whatever you use, the most important thing is: do it! The easier and more automated the better: if you dread making backups, you won’t do it.

Also, don’t settle for just one backup: what happens if you need to retrieve a backup from a while back – or your primary backup system fails? Best is multiple backups with multiple methods: backup to another disk and to the Internet (using sources like SpiderOak or Box.net or even Ubuntu One).

Lastly, there are a couple of Java-based backup programs – specifically, Areca and plan/b. Cory Buford wrote about these programs for Linux.com in 2008; it is hard to see how Java-based programs can reliably read all filesystem attributes and restore them without problem. Java, after all, is available from HP for the OpenVMS platform; do these programs restore all ODS-5 attributes? Do these programs work with other Linux filesystems like XFS and JFS and ext4? Perhaps someone can fill us in…

Whatever the case is – whatever the tool – go backup your personal systems now! I know I will be backing up. Don’t wait until your data is lost.

5 Programs That Should be in the Base Install

There are a number of programs that never seem to be installed with the base system, but should be. In this day and age of click-to-install, these programs will often require an additional install – I maintain that this should not be.

Most of these will be relevant to Linux, but the programs will often be missing on other commercial UNIXes also.

  • Ruby. This is the first that comes to mind. I have been installing ruby onto systems since 1.46 – and ruby is still a fantastic scripting language, and one of the best implementations of object-orientated programming since Smalltalk.
  • m4. I recently wrote about m4, and thought it was already installed on my Ubuntu Karmic system – not so. I used it to create a template for the APT sources.list file.
  • ssh. This should be installed everywhere automatically, and not as an add-on. For many UNIX systems, ssh is an add-on product that must be selected or compiled from source.
  • rsync. Rsync is a fabulous way to copy files across the network while minimizing traffic – even though it is not designed to be a fast way.
  • ksh. This will surprise most commercial UNIX administrators. However, Linux does not come with ksh installed – and the emulation by GNU bash is weak. Now you can install either AT&T ksh-93 (the newest version!) or the old standby, pdksh (which is close to ksh-88).

OpenVMS is a different animal – and some of the things that should be installed by default would be perl, ruby, java, SMH, and ssh. I’m not sure if perl or ssh is installed by default, but they should be. OpenVMS should also support compliant NFS v3 and v4 support out of the box – without making it difficult to connect to other NFS servers.

What programs do you think should be in the base install?

Using make and rsync for Data Replication

When maintaining a cluster environment (such as HP Serviceguard) there are often directories and configurations which need to be maintained on two different local disks (on different machines). Using make and rsync (with ssh) are excellent for this.

The rsync command allows you to replicate the local data onto the remote side copying only that which is necessary. This is not necessarily the fastest, but it is the most efficient: rsync was designed for efficiency over slow links, not speed over high speed links. Configure rsync to use ssh encryption automatically in the Makefile, then use rsync as the way to copy the files over:

RSYNC_RSH=/usr/bin/ssh -i /path/to/mykey

rsync -av $(LOCAL_FILES) remoteserver:$(PWD)

To automate this properly, an ssh key will have to be created using keygen and transfered to the other host. The private key (/path/to/mykey in this example) is used by ssh in the background during rsync processing; with the key in place, no interactive login is necessary.

For best purposes, create an “all” tag (at the top of the file) that explains the usable tags, and create a “copy” tag that does the relevant rsync.

I recommend copying only relevant files, not the entire directory: this way, some files can be retained only on one node – this is good for log files and for temporary files.

For example:

LOCAL_FILES=*.pkg *.ctl *.m4
RSYNC_RSH=/usr/bin/ssh -i /path/to/mykey

all:
    echo "To copy files, use the copy tag..."

copy:
    rsync -av $(LOCAL_FILES) remserver:$(PWD)

Make sure to verify the code before you use it in normal operation. Use the rsync option -n to perform a dry run which affects nothing. Also make sure that you don’t update files on different hosts; things might get interesting (and unfortunate…)

After performing the update, the Makefile can trigger a reconfiguration or a reload of the daemons to put the configuration in place.

Sparse files – what, why, and how

Sparse files are basically just like any other file except that blocks that only contain zeros (i.e., nothing) are not actually stored on disk. This means you can have an apparently 16G file – with 16G of “data” – only taking up 1G of space on disk.

This can be particularly useful in situations where the full disk may never be completely used. One such situation would be virtual machines. If a virtual machine never fills the disk entirely, then a certain amount of the disk will never have anything but zeros in it – permitting the saving of disk space by using a sparse file.

The operating system (which supports sparse files) knows that the block “exists” but is null, so it provides the zero-filled block out of thin air. As soon as the block contains data, the data is written to disk in the appropriate way and the file on disk grows.

There are problems with using sparse files. The most egregious would be that any utility that does not recognize or utilize sparse files can replicate the entire file on disk, so that a 500M sparse file could suddenly balloon to 6G. Even utilities that can work with sparse files must be told to do so, so this trap is easy to fall into.

Another problem is that everything about disk management is based on how many sectors or blocks are used – and the disk size reported for a sparse file is the full size of the file (not the actual number of blocks on disk). This also means that any utilities that work solely with blocks on disk (e.g., du) will report different amounts than other utilities.

Yet another problem is whether backup programs recognize or preserve sparse files. A backup program that does not recognize (and store) sparse files may well be quite oversized. A restore of this flawed backup – or indeed, a restore that does not recognize properly backed up sparse files – will balloon in size as mentioned before. In a worst case, there would not be enough room on disk to restore the data, if the sparse file expands enough to fill the disk.

To get the ls command to report actual on-disk sizes (instead of just the file size typically reported) use these options:

# ls -ls
total 96
8 -rw-r--r-- 1 root root 2003 Aug 14 2006 anaconda-ks.cfg
68 -rw-r--r-- 1 root root 59663 Aug 14 2006 install.log
8 -rw-r--r-- 1 root root 3317 Aug 14 2006 install.log.syslog
12 -rw------- 1 root root 10164 Oct 25 2006 mbox

The blocks used are in the left side column; the bytes used are in their usual column just before the date. In this case, all of the files are using appropriate number of blocks – that is, none of these files are sparse.

Creating a sparse file basically amounts to a simple process: create a file, and seek to the desired end of the file. This can be done at the command line with a command like (to create a 1M sparse file):

dd if=/dev/zero of=sparse-file bs=1 count=1 seek=1024k

Here is an example, run on HP-UX:

# dd if=/dev/zero of=sparse bs=1 count=1 seek=1024k
1+0 records in
1+0 records out
# ls -ls sparse*
2 -rw-r--r-- 1 root sys 1048577 May 22 12:58 sparse
#

Looking at HP-UX (and perhaps other environments), there does not appear to be a wide amount of support for sparse files in the typical utilities (such as tar, cp, cpio, etc.), even as the operating system itself will dutifully create sparse files.

However, GNU cp supports the –sparse option. By default, GNU cp attempts to detect sparse files and recreates them as warranted. A file can be copied into a sparse format using –sparse=always, or into a nonsparse format using –sparse=never. The default, alluded to earlier, is –sparse=auto.

GNU tar uses the –sparse option (or the equivalent -S option) to make tar store sparse files appropriately.

GNU cpio supports the –sparse option, which operates similarly: any suitable length of zeros is recorded as a “empty” and the file is stored or created as a sparse file.

The command rsync, while not a GNU project, does support sparse files. The option –sparse (or -S) will attempt to handle sparse files efficiently, that is, not creating file blocks full of only zeros.

It appears that utilities pax, scp, sftp, and ftp do not in general support sparse files. Using such a utility, then, would make a sparse file balloon in size.

Given this utility support for sparse files, the best native environment is proably Linux or MacOS X, or other open platform. More specifically, any platform where the GNU utilities are available, along rsync.

Recognizing a sparse file can be done at the command line with the previously mentioned ls -ls command. To see a sparse file (right down to the block level), one way would be to write a C program (or other language) to start at the end of the file, and truncate the file one block shorter – if the file on disk becomes shorter, then that block was on disk; if not, then it was a sparse block. Of course, any block that contains data is on disk; it is only blocks with zeros with which there is any question as to whether it is on disk or not. Jonathan Corbet had an article in December of 2007 in the Linux Weekly News that described a method that the Solaris ZFS developers had proposed, and posed the question as to whether Linux should support similar system calls that will seek to the next hole or the next set of data in a file.

Securing your network traffic

If you want to start some exciting discussion in a security forum, just say you use telnet: you’ll find that every admin knows that telnet is insecure, that one should use OpenSSH or similar to encrypt the traffic, and that telnet should be banned from the server environment entirely.

However, telnet is not the only server that transmits its passwords in the clear. There are a lot of others. Here’s a list I came up with:

  • FTP
  • HTTP
  • IMAP
  • IPP
  • LDAP
  • LPD
  • NFS
  • POP3
  • rsync
  • SMTP
  • SNMP
  • syslog
  • VNC
  • X11
  • XDMCP

I won’t cover all of these here (more about these items can be found in my book) but I do want to cover just a few.

Consider, for example, the mail protocols: SMTP, POP3, and IMAP. SSL encryption is available with all three – but do you use it? And what about your logins to your mailbox at your ISP? Every time you login, your password to your mailbox goes across the wire in the clear.

What about NFS – particularly NFS home directories? If you have unencrypted secrets in your home directory, then these items will be transmitted across the network in the clear as well. What about private SSH keys? Unfortunately, there is no way to encrypt NFS traffic.

VNC is another one to watch for: if you type passwords for your root logins over VNC – even if you are using SSH in your VNC session – the passwords are in the clear. The only way to secure VNC entirely is to use an SSH tunnel to encrypt it.

X11 is insecure in the same way, but presents special problems. However, OpenSSH handles X transparently through the use of special tunnels just for X.

syslog is another unencrypted service; do you have passwords put into the system logs? What about secret doings of your servers? How much information leakage can you handle? Unfortunately, syslog is another service that cannot be secured unless you use something such as syslog-ng which permits you to use TCP (and thus, an OpenSSH tunnel).

Using Parallel Processing for Text File Processing (and Shell Scripts)

Over at Onkar Joshi’s Blog, he wrote an article about how to write a shell script to process a text file using parallel processing. He provided an example script (provided here with my commentary in the comments):

# Split the file up into parts
# of 15,000 lines each
split -l 15000 originalFile.txt
#
# Process each part in separate processes -
# which will run on separate processors
# or CPU cores
for f in x*
do
runDataProcessor $f > $f.out &
done
#
# Now wait for all of the child processes
# to complete
wait
#
# All processing completed:
# build the text file back together
for k in *.out
do
cat $k >> combinedResult.txt
done

The commentary should be fairly complete. The main trick is to split the file into independent sections to be processed, then to process the sections independently. Not all files can be processed this way, but many can.

When the file is split up in this way, then multiple processes can be started to process the parts – and thus, separate processes will be put onto separate threads and the entire process will run faster. Other wise, with a single process, the entire process would only utilize one core, with no benefit of multiple cores.

The command split may already be on your system; HP-UX has it, and Linux has everything.

This combination could potentially be used for more things: the splitting of the task into parts (separate processes) then waiting for the end result and combining things back together as necessary. Remember that each process may get a separate processor, and that only if the operating system is supporting multiple threads will this work. Linux without SMP support will not work for example.

I did something like this in a Ruby script I wrote to rsync some massive files from a remote host – but in that case, there were no multiple cores (darn). I spawned multiple rsync commands within Ruby and tracked them with the Ruby script. It was mainly the network that was the bottleneck there, but it did speed things up some – with multiple cores and a faster CPU, who knows?

Also, in these days, most every scripting language has threads (which could potentially be run on multiple CPUs). I’ll see if I can’t put something together about threading in Ruby or Perl one of these days.

UPDATE:: Fixed links to article and blog (thanks for the tip, Onkar!).