JournalSpace Dies by Data Loss

The blogging site JournalSpace has been shut down after there was significant data loss without backups. The entrepeneur’s blog has more information – apparently, the most likely cause seems to be sabotage by a former IT staff person, combined with the lack of working backups.

What can we learn from this unfortunate incident? There are a number of things to note here:

  • Remove all access for former staff in its entirety – don’t skimp! All access, passwords, server access, everything. Lock it down. If you have only one IT staffer, you are also at risk: you need to be able to call on someone who can lock out your fired (or laid off) employee completely.
  • Disk RAID is not a backup solution. RAID protects you from disk failure, not from “data failure” or operator mistakes. Do not forget to have a complete backup solution in place. It also pays to enable a “hot spare” so that if one of the disks fail, that there is still protection from disk loss.
  • Have a backup solution. You must have a comprehensive backup plan working, tested, and implemented.
  • Have a working backup solution. This point cannot be stressed enough: Test your backup solution before you need to use it! When the data is gone is no time to realize that the backups are useless. Test your backups in real-world scenarios as well: one story described a backup solution that was well-tested, then the tapes went off-site in the operator’s car. Unfortunately, sitting in the car caused the tapes to be demagnetized and this was realized only after the data was gone. Test those backups!

The dreadful story of JournalSpace might have had a different ending if they had only tested their backups: that alone would have saved them. However, solutions (like security) should be in depth: working backups might not be enough next time.

Disaster recovery planning

Planning for a disaster is not necessarily as easy as it sounds. It helps if you have a rampant imagination. Throughout disaster planning, the dominant question is What if…? Following the planning, testing is required: the best plans are worthless if they don’t work in practice.

Consider an Internet server serving web pages. Let’s assume that downtime is not an option: this is a typical point to start at. The best thing to do is to start with the most specific to the system (the complete environment) and work out:

  • What if… a disk goes bad?
  • What if… the software stops?
  • What if… memory runs out?
  • What if… the power goes out?
  • What if… the kernel panics?
  • What if… the cluster failover fails?
  • What if… the network switch fails?
  • What if… the network firewall fails?
  • What if… the internet link goes down?
  • What if… the internet provider drops off the grid?

Each one of the questions must be answered and the results tested. To test for power outage, pull the power. For a failed network switch, pull the network cable – and so forth.

Most of the answers will include some form of redundancy – clusters, dual facilities (such as power and network and internet providers), and so on. However redundancy is only one solution; there is prevention and alerts as well.

Each risk must be weighed against the cost to mitigate that risk. However, assuming that the risk is minimal does not eliminate the risk; the biggest problem is not accounting for a risk that eventually happens. There is nothing like downtime of a critical server to get an unforeseen risk taken care of; better to handle the risk before it happens.

It also does not matter if the plans have not been tested. If tests are not done, then the actual event will be the first time things have been put to the test – and what if something was missed and the system goes down? During a test, preventive measures can be taken to make sure that things work as they should – during an unexpected event, it is not possible to back out or prepare; if things go down they go hard. Don’t let that happen to you!

And disaster planning is not limited to servers (or virtual servers) – what about the possibility of a server hosting multiple virtual servers going down? What if your server is hacked into? What about your monitoring system failing? What about getting paged? Have you planned contingencies for all of these events?

Plan, then test, test, test – and you will make it.