Monitoring APC UPSes (about cables)

APC UPSes have a port on the back for UPS monitoring (serial, USB, or RJ45). Before you can use these ports, there is a variety of things to be aware of – things that are not all in one place.

Firstly, the USB port is not a standard USB port and should not be used as such; it requires a special cable. If you use the USB port on an APC UPS, then you will have to cycle the power on the unit to start using the serial port instead.

Secondly, the serial port also requires a special cable (the example here uses the SmartUPS 2200XL). There are two types of cables: simple signaling and smart signaling. Simple signaling cables (usually gray) work on all APC UPSes, but BackUPS units only support simple signaling. The simple signaling cables provide very few capabilities (only three settings: Battery On, Battery Low, and UPS Off) – all based on measuring pin levels on the serial cable itself. Smart signaling cables (normally black) are used for more powerful units (such as SmartUPS and others). The smart signaling cable connects the monitoring system to the UPS and provides a character interface to the unit with a large number of commands.

The Linux UPS monitoring tool nut supports smart signaling cables; you should use smart signaling cables whenever you can because of the added capabilities they allow.

Making the Case for Partitioning

What is it about partitioning? The old school rule was that there were separate partitions for /, /usr, /home, /var, and /tmp. In fact, default server installations (including Solaris) still use this partitioning set up.

Has partitioning outlived its usefulness?

This question has come up before. There are negative and positive aspects to partitioning, and the case for partitioning might not be as strong as it once was.

Partitioning means that you may have a situation with no space in one partition and lots in another. This, in fact, is the most common argument against partitioning. However, using LVM or ZFS where disks can be grown dynamically makes this a moot point. With technologies such as ZFS and LVM, you can expand a disk and filesystem any time you need to.

However, this still means that the system will require maintenance – but that is what administrators are for, right? If a filesystem fills – or is going to fill – it is up to a system administrator to find the disk space and allocate it to the disk.

Another argument against partitioning says “Disk is cheap.” Well, if this is true, then why do companies still balk at getting terabytes of disk into their SANs? The phrase “Disk is cheap” is trite but untested: in truth, buying 144Gb disks is not done in bulk. Companies still have to watch the budget, and getting more disk space is not necessarily going to be a high priority until disk runs out.

So, what are the benefits to partitioning disks? There are many.

Each partition can be treated seperately – so the /usr filesystem can be mounted read-only, the /home directory can be mounted with the noexec and nosuid options, which makes for a more secure and more robust system.

Also, if there are disk errors, then a single partition can be affected rather than the entire system. Thus, on a reboot, the system still comes up instead of being blocked because the root filesystem is trashed. In this same vein, if a filesystem requires a check, going through a 144Gb filesystem check could take a very long time – whereas, if the partition was 10Gb it would be not nearly as long and the system would come back up that much faster.

Backups – and restores – are another thing that is simplified by having multiple partitions. For example, when backing up an HP-UX system using make_tape_recovery, you specify which partitions to back up to tape. These partitions are then restored when the tape is booted. If you used a single partition for everything (data, home, etc.) then you would probably not be able to make this sort of backup at all.

One of the nicest reasons to partition is the ability to separate user data from system data. This allows the reinstallation of the system while keeping user data (and application data) untouched. This saves time and effort. I recently installed Ubuntu Server in place of Red Hat Enterprise Linux, and since the system was a single partition, there was no way to install Ubuntu Server without wiping out 200Gb of application data and restoring it – which took around 9 hours each way on a gigabit network link (if nothing else was sharing the network). Alternately, I converted my OpenSUSE laptop to using Xubuntu – and was able to keep all of my user settings because /home was on a separate partition. Keeping a system on a single partition cost the company somewhere in the order of a full day’s worth of time and effort – how much money would the company have saved by having a separate partition for /var/lib/mysql?

Performance is another reason for partitioning – but this is only relevant for separate disks. If you have multiple disks, you can segregate them into separate partitions, which means then that if a disk is heavily used for one purpose it can be dedicated to that purpose – you won’t have your database accesses slowing down because of system log writes, for example. However, this problem can be reduced or eliminated – or made moot – by moving data to a striped volume, and possibly with disk cache as well. Yet, as long as there are disk accesses, do you want them competing with your database?

Having said all of this, how does using a virtual machine change these rules? Do you need partitioning in a virtual machine?

A virtual machine makes all of the arguments relating to the physical disk moot – performance on a virtual disk doesn’t have a high correlation with true physical hardware unless the virtual machine host is set up with segregated physical disk. Even so, the disks may actually be separate LUNs created in a RAID.

However, the ability to secure a filesystem (such as /usr), save filesystem check time, prevent excessive /home usage, and other reasons suggest that the case for partitions is still valid.

Making the Case for IPMI

IPMI is a system that allows you to maintain a system where it would not otherwise be able to. However, you have to convince others that it will be useful. As obvious as it is for a professional system administrator, there are others who will not see the usefulness of IPMI. This process – making the case for a technology – comes up more than most system administrators might realize.

There are numerous situations that require having a person to be actually present at a machine to do things – such as entering the setup, accessing the UNIX/Linux maintenance shell, changing boot devices, and so forth. So how do you prove how beneficial implementing full IPMI support is?

Provide a business case – and perhaps an informal user story – to show how IPMI can reduce the need for a person to actually be present.

To really make the case for IPMI, compute the actual costs in making a trip to the data center – the hourly cost of the administrator, the driving costs in gasoline (both to and from!), and the costs associated with handling the expense report. Report the other unquantifiable costs – the cost of an administrator unavailable for other tasks during the four hour round trip, including projects being delayed and problems not getting resolved. Combine this with a user story.

For example, create a user story around a possible kernel panic. A user story requires an actual user – an individual – whose story is followed. Here is our example continued as a user story:

Alma received an email that a system (db20) was unresponsive. Checking the system found that there was no response from the network at all, and checking the KVM showed that there was a panic on the display. No keystrokes were accepted by the system, and there was no way to power-cycle the system.

So Alma sends an email stating that she will be unavailable for the rest of the day, and calls her babysitter to take come and take care of her three children for the evening. Then she gets into her car and drives the two hours to the data center and parks in a lot ($10 for one hour). She power-cycles the machine by pressing its front panel power button, and checks the system response using her laptop. She finds that the server is responding and logs in.

Then Alma checks the server: logs show no problems restarting, the system has restarted cleanly, the subsystems are all running, and the monitoring system shows all systems are good.

Alma leaves the data center and drives the two hours back home.

If Alma is paid $60,000 yearly, the cost of her time spent on this event is US$144.23. If she drove 320 miles round trip at .76 cents a mile, she gets US$243.20 as an expense – in addition to the US$10 in parking fees. This makes a total direct cost of US$397.43.

If something like this happens six times a year, then the total yearly cost is US$2384.58 – and total downtime for the server is 24 hours for an uptime rate of 99.72%.

This account doesn’t include the indirect costs – such as projects being delayed because Alma was unable to work on them, nor does it include the personal costs involved such as babysitting and time away from family. It also doesn’t include the time that HR staff spent on yet another expense report. It also doesn’t include the costs associated with the server being unavailable for three hours.

On the other hand, Polly received word that another server in the data center was unresponsive, and also found that the kernel had panicked and there was no response from the console. She then used a command line tool to access the baseboard management controller (BMC) through IPMI. With an IPMI command, she rebooted the server, and watched the response on the KVM. Checking the system over, Polly found that the server had booted cleanly, subsystems were operational, and all looked good.

If Polly is paid the same amount as Alma, and her response took 15 minutes, we get a total cost of US$7.21.  Downtime was reduced by 92% (along with an associated reduction in costs tied to the server being down). If this happens to Polly six times a year, the total yearly cost is US$43.27 – and a downtime of 1.5 hours for an uptime rate of 99.98%.

Thus, IPMI and SoL would have saved Alma’s company US$2377.37 per year.

The strongest case can be made if a recent event could have been solved with the technology you are proposing. If you can point to a situation that could have been resolved through ten minutes instead of hours or days without – then the usefulness of the technology will be apparent.

With this user story and business case, the case for IPMI should be readily apparent to just about anybody. Similarly, the case can be made for other technologies in the same way.

Lessons in Communications and Reliability Learned from Egypt

You may have already heard about what has been happening in Egypt. If not, the Arab media source Al Jazeera has a dedicated page on the topic. You can watch a live Twitter stream from Twitterfall.

As a part of what is happening in Egypt, Internet access in the country was disrupted and blocked, and cell phone service was halted. In particular, DNS servers were shut down or blocked, and web sites such as Facebook and Twitter were blocked completely. The majority of Egyptian ISPs shut down as well, effectively removing all access to the Internet to Egyptian citizens.

Here are some of the resolutions to these problems that Egyptians found:

  • Broadcast access to Internet via open wireless access points. With wireless access points set up, only one person needs to set up the Internet in order for dozens to receive it. At least one ISP in Egypt remained up (due to government and bank usage) – providing an open hot spot expanded
  • Use gateways and proxies to reach forbidden web sites. Routing traffic to other sites – sites that aren’t blocked – permits access to the blocked sites. This is a form of “forced routing” that goes around censorship.
  • Use alternate DNS servers or IP addresses or both. There are public DNS servers available to all such as Google and OpenDNS; if DNS is down one can switch to these if you know how – and use IP addresses if you don’t.
  • Use out-of-country dialup services. Several ISPs gave out public access to their dialup services for Egyptian citizens to reach the outside.
  • Use non-internet-based methods of communication. In Egypt, there were printed leaflets, as well as amateur radio. When communication via the Internet is out, there are alternatives.

There is a web page that details all the possibilities for getting communication out of Egypt.

If you can handle a man-made disaster – such as the cutoff of Egypt from the Internet, or the dismantling of the Wikileaks technical structure – then natural disasters seem pale by comparison.

We’ll pray for safety and recovery in Egypt.

DNS: An Enemy of Uptime?

Recently, there has been a lot of things in the news about DNS services being a weakness of one sort or another. Comcast customers in the American Midwest experienced downtime just within the last week due to DNS servers not being available – which happened previously on the American East Coast. Wikileaks.com became unavailable when its DNS supplier, EveryDNS, cut them off. Many sites found their domains seized by the US government without warning and without legal warrants.

What can one do to prevent these sorts of downtime and unavailability of DNS services? One person is already considering this: the founder of PirateBay is attempting to create a distributed DNS as a result of the US government’s seizure of numerous domains.

Others have noted related projects – projects which hoped to be alternate root servers. One such project, Telecomix DNS, has been reinvigorated by the recent domain seizures – and even has a page for those who own seized domains. In the history of the Internet, alternate domain root servers have sprung up – such as AlterNIC and Open Root Server Network and OpenNIC – but most have shut down (OpenNIC continues to operate).

However, most of those projects suffer from the same availability problems that others do: if the service shuts down, the domains become unavailable. If the owner is forced or convinced to seize domains, then the domain is gone. With a truly distributed service, this becomes impossible, and availability increases.

What has Wikileaks done to solve this problem (aside from moving to a Canadian DNS provider named EasyDNS) is to add multiple DNS providers beyond just EasyDNS. PCWorld has a nice article detailing all of what Wikileaks does to stay online – which provides a good lesson for the rest of us. EasyDNS also has an excellent article on how to keep DNS up and running in the face of a denial of service attack, written in fantastic detail by the service provider.

Have you considered what would happen if your primary DNS resolver went offline? Even if you have your own DNS server in-house, there is an upstream server that could potentially go away. Maybe there are even two or three different servers that your servers send requests to – but are they the same provider? There are several services that provide free DNS – including:

Make DNS a part of your disaster recovery plan and prevent it from taking your services down – do it today.

Update: ITWorld has a nice article that explains several projects that have sprung up to make DNS resistant to censorship by a central entity.

How Systems Fail (and Principles of Prevention)

A system failure does not always have a single, identifiable (and preventable) cause. In fact, often it does not.

The BP oil spill provides an excellent case study of this situation: the spill has numerous contributing causes, most of which seem minor or unlikely in isolation, but which taken together resulted in catastrophe.

What general principles can we apply to prevent such cascade failures? One should already be familiar: Test, test, test fail-over systems before they are needed. Without testing, there is no guarantee that things will work as expected. In the BP oil spill, several failsafe systems did not properly engage as expected. In the recent IT disaster in Virginia, a storage system did not fail over when needed, resulting in a down-time of several days for a significant portion of state resources.

Another principle is: Sift and winnow alarms to the truly necessary. Pay attention to alarms and other indicators, and to purge alarms and notifications that are irrelevant or unnecessary. In the oil spill, one of the causes was that the indications that there was an immediate problem were masked by other occurrences at the time. Too many alarms and too much information can hide problems and lead to administrators ignoring problems that require intervention.

To improve alarms, and prevent unnecessary alarms, use time based alarms, i.e., alarms that account for change over time. Products such as SGI’s Performance CoPilot (PCP), HP’s Performance Agent (included with some versions of HP-UX), and the System Event Correlator (SEC) all help. With alarms that work over a time-span, momentary brief peaks will not result in alarms, but chronic problems will.

Another item that can improve alarm response is systemic alarms: that is, monitoring that accounts for multiple systems or processes, and combines it all into an appropriate setting. Is the web site running smoothly? Are all systems reporting logs to the central server? Are all virtual environments running?

One of the earliest problems that lead to the oil spill was a lack of oversight of the contractors involved in building the well. Each assumed that the others would do the right thing. In system administration, we assume that other systems are functioning correctly. To prevent failure, we should assume that other systems and processes could fail, and verify for ourselves that the systems we are responsible for will not fail in any case. What systems are you dependent on? Power? Cooling? Serial concentrator? Operations staff? Backup staff?

Part of the problem in the Virginia failure was that the state did not oversee the external vendor well. With proper oversight (and demands), the state IT staff could have forced the storage vendor to test the fail-over processes, and could have implemented a backup plan in case the fail-over did not take place as expected.

By studying other’s fail patterns and experiences, we can help to minimize our own.

Virginia Experiences a Severe IT Outage

The Virginia Information Technologies Agency (VITA) has been dealing with an outage that resulted during a scan by storage technicians for failed components. The outage “only” affects 228 of 3,600 servers – which affects 24 different Virginia departments, including the Virginia Dept. of Motor Vehicles (DMV), the Governor’s Office, the Dept. of Taxation, the Dept. of Alcoholic Beverage Control, and even VITA itself.

No word on how many Virginians are affected, but certainly it will be in the thousands.

According to the Associated Press, the crash happened when a memory card failed and the fall-back hardware failed to operate successfully. From the sound of it, an entire storage array was affected by this – how else to account for the 228 servers affected?

This suggests that the storage array is a single point of failure, and that the memory card was not tested nor its fall-back. There should be some way of testing the hardware, or to have a clustered storage backup.

One of the biggest problems in many IT environments – including states – is budget: having full redundancy for all subsystems is expensive. States are not known for budgets that fill all departmental needs (despite large budgets, departments scrounge most of the time…). Many other data center owners consistently have tight budgets: libraries, non-profits, trade associations, etc.

The usual response to tight budgets is to consider the likelihood of failure in a particular component, and to avoid redundancy in such components. A better way to look at it would be to compare not the likelihood of failure, but the cost of failure. The failure in Virginia’s storage was supposed to be extremely unlikely, but has had a tremendous cost both to Virginia and to Northrup Grumman.

Virginia’s IT was outsourced in part to Northrup Grumman, and was supposed to be a model of privatizing state IT in the nation. However, Virginia’s experience shows how privatizing can fail, and how outsourcing companies do not necessarily provide the service that is desired.

The Washington Post reported on this failure, as well as others in the past. There have been other failures, such as when the network failed and there was no backup in place. Both Virginia and Northrup Grumman should have noticed this and rectified it before it was necessary.

The Richmond Times Dispatch has an article on this outage, and will be updating over the weekend.

Three Technologies We Wish Were in Linux (and More!)

Recently, an AIX administrator named Jon Buys talked about three tools he wishes that were available in Linux. Mainly, these technologies (not tools) are actually part of enterprise class UNIX environments in almost every case.

One was a tool to create a bootable system recovery disk. AIX calls the tool to do this makesysb; in my world – HP-UX – this is called make_tape_recovery. In HP-UX, this utility allows you to specify what part of the root volume (vg00) to save and other volumes. Booting the tape created from the make_tape_recovery utility will allow you to recreate the system – whether as part of a cloning process or part of a system recovery.

Another technology missing from Linux is the ability to rescan the system buses for new hardware. In Jon’s article, he describes the AIX utility cfgmgr. HP-UX utilizes the tool ioscan to scan for new I/O devices. Jon mentions LVM (which has its roots in HP-UX) but this does not preclude scanning for new devices (as any HP-UX administrator can attest).

Jon then discusses Spotlight (from MacOS X) and laments that it is missing from Linux. Linux has Beagle and Tracker, and all are quite annoying and provide nothing that locate does not – and on top of this, locate is present on AIX, HP-UX, Solaris, and others. I for one would like to completely disable and remove Spotlight from my MacOS X systems – Quicksilver and Launchbar are both better than Spotlight. In any case, all of these tools don’t really belong on an enterprise-class UNIX system anyway.

As for me, there are some more technologies that are still missing from Linux. One is LVM snapshots: while they exist in Linux, they are more cumbersome. In HP-UX (the model for Linux LVM) a snapshot is created from an empty logical volume at mount time, and the snapshot disappears during a dismount. In Linux, the snapshot created during logical volume create time (whatever for??) and then is destroyed by a logical volume delete. The snapshot operation should mirror that of HP-UX, which is much simpler.

Another thing missing from Linux which is present in every HP-UX (enterprise) system is a tool like GlancePlus: a monitoring tool with graphs and alarms (and the alarms include time-related alarms).

Consider an alarm to send an email when all disks in the system average over 75% busy for 5 minutes running. This can be done in HP-UX; not so in a standard Linux install. There are many others as well.

Personally, I think that Performance Co-Pilot could fill this need; however, I’m not aware of any enterprise class Linux that includes PCP as part of its standard supported installation. PCP has its roots in IRIX from SGI – enterprise UNIX – and puts GlancePlus to shame.

Perhaps one of the biggest things missing from Linux – though not specifically related to Linux – is enterprise-class hardware: the standard “PC” platform is not suitable for a corporate data center.

While the hardware will certainly work, it remains unsuitable for serious deployments. Enterprise servers – of all kinds – offer a variety of enhanced abilities that are not present in a PC system. Consider:

  • Hot-swappable hard drives – i.e., hard drives that can be removed and replaced during system operation without affecting the system adversely.
  • Hot-swappable I/O cards during system operation.
  • Cell-based operations – or hardware-based partitioning.

For Linux deployment, the best idea may be to go with virtualized Linux servers on enterprise-class UNIX, or with Linux on Power from IBM – I don’t know of any other enterprise-class Linux platform (not on Itanium and not on Sparc) – and Linux on Power may not support much of the enterprise needs listed earlier either.

What are your thoughts?

Unattended Ubuntu Installations

Michael Reed wrote a nice tutorial for Linux User and Developer about unattended Ubuntu installations. Ubuntu aso has excellent documentation on the booting process.

Unattended installations are a boon to system administrators, or anyone who needs to install systems several times a month or so.

When an installation is preconfigured – especially using the network – an administrator can start the installation then go do something else. Productivity increases dramatically. If you have many systems to install – you can start all of them and leave them to finish on their own while you complete other tasks.

An unattended installation also leads into systems being identically configured, providing for easier administration and easier debugging. If systems are hand-crafted, then replicating the system (such as hardware upgrades or restores) will be harder – or more likely, impossible.

Since Ubuntu is based on Debian, the original preseed configuration tool is available; however, the excellent kickstart (and anaconda) program from Red Hat was adapted to Ubuntu. Preseed handles more options, but kickstart offers its own advantages as well. They can be used together for installations.

A whitepaper from Canonical (Automated Deployments of Ubuntu, by Nick Barcet in September 2008) describes in easy-to-read detail all the possibilities of unattended installations with preseed and kickstart. While the whitepaper discusses these things in context of Ubuntu 8.04, it is still a valuable resource in concert with the official Ubuntu install documentation.

The files required for booting from the network are available from Ubuntu mirrors; be sure to get the right one for your architecture and Ubuntu version (this discussion assumes ix86 and Ubuntu Lucid Lynx).

An abbreviated description of the process to set up for kickstart is this:

  • Set up an HTTP server (web server).
  • Set up a DHCP server.
  • Set up a TFTP server.
  • Set up a NFS server.

Once these three servers are up, you are more than half-way there. The kickstart file will be placed on the web server for the unattended installations.

Configure the DHCP server to use the pxelinux.0 file for the boot loader, and from the appropriate TFTP server. The boot files have to be located on the TFTP server in the appropriate locations.

Make sure, too, that all of the files necessary for installation – the entire distribution – is available via NFS or via the Internet. Once the boot loader is started, and the Linux kernel is loaded, all of the packages will be downloaded from this source.

Best thing to do is to test the autoinstalls on a virtual machine if possible, a scratch machine if not – and test until the installs are clean and unattended. Do this before your boss says you need to get that install done pronto!

Check the documentation for details; I’m currently setting up Lucid Lynx and will provide complete details later.

Is Your Data Center Power and Cooling Truly Safe?

Recently, Amazon has made the news for several unrelated outages at different data centers. The causes of these outages are very interesting, and provide a lesson for the rest of us.

The most recent affected Amazon EC2 users on 13 May. This outage was caused when a cutover from the loss of utility power failed: a switch failed to activate as it should. To make matters worse, the switch failed because of a misconfiguration that was done by the manufacturer.

This outage is compared to a similar one that affected Rackspace in 2007. During a power outage from the utility (because of a car crash), the power went on and off a couple of times, preventing the cooling apparatus from cooling the data center properly. With the rising heat in the data center, they had to shut down equipment or suffer equipment failure.

Another power loss affected an Amazon data center on 4 May (twice). The day was to involve a switchover from one power substation to another (from the electric utility). During this process, one UPS failed to cutover to the backup generators, resulting in an outage to a number of servers. Later that day, after bypassing the failed UPS, human error caused one of the backup generators to shut down, taking down servers once again.

One of the biggest problems in resolving these problems is money. Until something like this happens to a company, many do not want to put forward the money that it would take to avoid such outages.

What would it take to avoid outages like these? The assumption will be that power outages will occur and colling will fail; how do we handle it? How do we increase reliability?

  • Don’t rely on the UPS to save the data center. Use multiple UPSes or put each rack on a racked UPS.
  • Cluster multiple servers together across data centers in an active/active configuration so that downtime in one data center will be mitigated by another.
  • Pair administrators or team members together to reduce human error. Studies show that the worst evaluator of our own capabilities is ourselves (“We have met the enemy and he is us.”)
  • Make sure that cooling devices can run on backup power. Also verify that cooling devices will run long enough to shut down servers if necessary.
  • Investigate alternative cooling sources, such as external air, or “rack-local” cooling that could run off of UPS power.
  • Don’t trust the manufacturer. Nine times out of ten things might be just fine; that leaves a 10% failure rate. Shoot for 100%!
  • Remove all singular paths of failure. What if a UPS fails? What if an air conditioning unit fails? What if the power fails?
  • Test before things happen! Do you have complete protection from power failures? Test it. Do you have complete protection against cooling loss? Test it.

Even smaller companies with a single on-site data center could take advantage of some of these options without breaking the bank. Clustering many servers with an external data hosting site could be done, or a company could partner with a related company for an external server hosting site.

To save labor costs, one could extend the time needed to complete the tasks without adding any more staff to the project.

One other thing that could be done for reliability’s sake – but is probably cost prohibitive for all but the most voracious and deep-pocketed corporations – add a second power link from a different substation and different links. Power goes “out” from one substation, the other substation will continue to provide power. (Not sure that is even possible…)

Note that monitoring does not take the place of testing. There is a difference between knowing that the data center is without power and preventing the data center from losing power.

System checks do not take the place of testing. The UPS might claim to be working fine – do you trust it? Perhaps it will work just fine – and maybe not. Perhaps it will work, but it is misconfigured.

Test! One of my favorite stories about testing has to do with a technical school that provided mainframe computing services for students and faculty. They tested a complete power cycle of the mainframe each month. One month, they went through their testing, rectifying whatever needed rectifying as they went. Two days later, a power loss to the entire school during a thunderstorm required an active restart of the mainframe – without incident.

Would that have been as smooth without testing? I doubt it.