, , , ,

Planning for a disaster is not necessarily as easy as it sounds. It helps if you have a rampant imagination. Throughout disaster planning, the dominant question is What if…? Following the planning, testing is required: the best plans are worthless if they don’t work in practice.

Consider an Internet server serving web pages. Let’s assume that downtime is not an option: this is a typical point to start at. The best thing to do is to start with the most specific to the system (the complete environment) and work out:

  • What if… a disk goes bad?
  • What if… the software stops?
  • What if… memory runs out?
  • What if… the power goes out?
  • What if… the kernel panics?
  • What if… the cluster failover fails?
  • What if… the network switch fails?
  • What if… the network firewall fails?
  • What if… the internet link goes down?
  • What if… the internet provider drops off the grid?

Each one of the questions must be answered and the results tested. To test for power outage, pull the power. For a failed network switch, pull the network cable – and so forth.

Most of the answers will include some form of redundancy – clusters, dual facilities (such as power and network and internet providers), and so on. However redundancy is only one solution; there is prevention and alerts as well.

Each risk must be weighed against the cost to mitigate that risk. However, assuming that the risk is minimal does not eliminate the risk; the biggest problem is not accounting for a risk that eventually happens. There is nothing like downtime of a critical server to get an unforeseen risk taken care of; better to handle the risk before it happens.

It also does not matter if the plans have not been tested. If tests are not done, then the actual event will be the first time things have been put to the test – and what if something was missed and the system goes down? During a test, preventive measures can be taken to make sure that things work as they should – during an unexpected event, it is not possible to back out or prepare; if things go down they go hard. Don’t let that happen to you!

And disaster planning is not limited to servers (or virtual servers) – what about the possibility of a server hosting multiple virtual servers going down? What if your server is hacked into? What about your monitoring system failing? What about getting paged? Have you planned contingencies for all of these events?

Plan, then test, test, test – and you will make it.