Backups: What You’d Rather Not Know

Some time ago, Elizabeth Zwicky wrote an article for LISA V (1991) titled Torture-testing Backup and Archive Programs (PDF) – and followed it up in LISA 2003 with Further Torture: More Testing of Backup and Archive Programs. The articles describe the tests of backup clients and archive programs extensively, and finds that all come short in some way or another – though the programs improved significantly over time.

These articles are real eye-opener; they show why a restore test is a critical part of any backup solution. Without testing a restore, there is no guarantee that an actual restore will be successful.

There are lots of stories about otherwise brilliant backup solutions that failed when a restore was necessary. My favorite was of a fellow who took the magnetic tape backups home as an offsite measure – except that he kept a massive magnet in the passenger seat of his car. The offsite backups were great – except he erased them (unknowingly) every time he took them home… Guess what happened when the offsite backups were needed during a critical restore?

To create a successful backup strategy, you must first choose how to make the backups:

  • Gauge how critical the resource is. Do the backups need to be restored in minutes? Or is a restore in hours suitable?
  • What kinds of backups will be taken? Full backups nightly? Incremental?
  • Gauge the time and space available to take backups. Will the backup put a strain on the network? Is there enough space?
  • Choose a program or programs to fulfill your needs and install.

After the infrastructure is in place, a successful backup strategy must:

  • Perform a test backup, and measure the time and space taken.
  • Perform a test restore (of a portion of the backup). How easy is it? Is it easy to use under pressure? Was it an accurate restore?
  • Do a bare metal restore. How long did it take? Is it accurate?
  • Perform a restore test from time to time to make sure that backups are good: once is not enough.

Only through diligent testing of both backup and restore can you be sure that everything is working properly, and your data safe.

Personal Backups: A Lesson in Computing Safety

Over at the Daring Fireball blog, John Gruber has a nice article about how his extensive backup saved him recently from losing data on a hard drive that died. I know the utilities he speaks of (SuperDuper and DiskWarrior) and can vouch for their usefulness on Macintoshes (although I prefer to use psync instead of SuperDuper).

It was Merlin Mann over at 43Folders who noticed the article and then wrote his own take on John’s article.

For Linux and Unix, backups are much more varied. Two of the most widely known programs are Bacula and Amanda, both enterprise-level backup tools. For personal use, I prefer to use rsync to make copies of my home directories. There are a large number of tools that use rsync to make backups; one tool is from Mike Rubel from way back in 2004, with a comprehensive article titled Easy Automated Snapshot-Style Backups with Linux and Rsync. Another good article is from Joe Brockmeier on titled Back up like an expert with rsync.

One popular (and simple) backup program is Dirvish, which I believe uses rsync behind the scenes. I’ve used the KDE app keep before, which was an easy and pleasant experience. The program rdiff-backup is also commonly recommended.

Whatever you use, the most important thing is: do it! The easier and more automated the better: if you dread making backups, you won’t do it.

Also, don’t settle for just one backup: what happens if you need to retrieve a backup from a while back – or your primary backup system fails? Best is multiple backups with multiple methods: backup to another disk and to the Internet (using sources like SpiderOak or or even Ubuntu One).

Lastly, there are a couple of Java-based backup programs – specifically, Areca and plan/b. Cory Buford wrote about these programs for in 2008; it is hard to see how Java-based programs can reliably read all filesystem attributes and restore them without problem. Java, after all, is available from HP for the OpenVMS platform; do these programs restore all ODS-5 attributes? Do these programs work with other Linux filesystems like XFS and JFS and ext4? Perhaps someone can fill us in…

Whatever the case is – whatever the tool – go backup your personal systems now! I know I will be backing up. Don’t wait until your data is lost.

JournalSpace Dies by Data Loss

The blogging site JournalSpace has been shut down after there was significant data loss without backups. The entrepeneur’s blog has more information – apparently, the most likely cause seems to be sabotage by a former IT staff person, combined with the lack of working backups.

What can we learn from this unfortunate incident? There are a number of things to note here:

  • Remove all access for former staff in its entirety – don’t skimp! All access, passwords, server access, everything. Lock it down. If you have only one IT staffer, you are also at risk: you need to be able to call on someone who can lock out your fired (or laid off) employee completely.
  • Disk RAID is not a backup solution. RAID protects you from disk failure, not from “data failure” or operator mistakes. Do not forget to have a complete backup solution in place. It also pays to enable a “hot spare” so that if one of the disks fail, that there is still protection from disk loss.
  • Have a backup solution. You must have a comprehensive backup plan working, tested, and implemented.
  • Have a working backup solution. This point cannot be stressed enough: Test your backup solution before you need to use it! When the data is gone is no time to realize that the backups are useless. Test your backups in real-world scenarios as well: one story described a backup solution that was well-tested, then the tapes went off-site in the operator’s car. Unfortunately, sitting in the car caused the tapes to be demagnetized and this was realized only after the data was gone. Test those backups!

The dreadful story of JournalSpace might have had a different ending if they had only tested their backups: that alone would have saved them. However, solutions (like security) should be in depth: working backups might not be enough next time.

VxFS (or HP Online JFS) Snapshots

A disk snapshot is a snap in time, a picture of what a disk looked like “back then”. This can be very useful for maintenance.

For example, being able to freeze a Caché instance, take a disk snapshot, then thaw the Caché instance will permit you to take backups or copies of a Caché database with minimal downtime.

For HP-UX Online JFS and Veritas VxFS the commands are the same (since these are actually the same product – or close to it). To actually do a snapshot:

mount -F vxfs -o snapof=/var/cache/db /dev/snap01 /snap

The first file system presented in the command line is key: it is the source of the snapshot. Note that it can be either a device or a current mount point. The second (device) is a filesystem prepared to hold a snapshot, and the last is the usual mount point.

Once this is done, the normal filesystem can continue to be used while the snapshot retains the older data as it was taken. In the example above, /var/cache/db could be used normally while the snapshot resides on /snap. If there was a directory /var/cache/db/db01 then there would also be a /snap/db01 available as well.

One caveat is that as long as the snapshot is mounted and in use, the changes to the original filesystem are being saved – it is possible that the snapshot volume can run out of space. When this happens, you will receive what may appear to be mysterious disk full errors unless you realize what is happening. So don’t keep your snapshots around forever.