Naming Those Servers

When it comes time to name a server, there are a few places to go for ideas – and a few things to remember.

One of my favorites is to use Google Sets. If you give Google several entries, Google will come back with more that match your list. This can be handy when you are trying to come up with a new name in a list.

An unknown champion in this area would be the site NamingSchemes.com. Once you’ve found this site, you’ll want to bookmark it – there’s nothing like it out there. If some scheme is missing – you can add it, since it is an open wiki.

Lastly, there is a question over at ServerFault which has some fabulous naming schemes that people have used (my all time favorite has to be the one that used names from the RFB list!). Another fantastic list from that question involves something like the players name’s from the famous Abbot and Costello skit Who’s on First.

There is also an IEEE RFC on Naming Your Computer (RFC-1178) which has good tips.

In my background, I’ve used (or seen used – or heard of) schemes based on the Horses of the Apocolypse (“war”, “famine”, etc.), colors, headache relief (“aspirin”, “acetominophen”, etc.), “test” and synonyms (“test”, “quiz”, “exam” – for testing servers!), cities, and fruits. The cities environment was fun – the hardware had two partitions, so naming the partitions was done with cities that were near to each other. The headache relief was fun until the pharmaceutical company found some of the names were of competing drugs…

Think about your names – and have some fun, too!

Making the Case for IPMI

IPMI is a system that allows you to maintain a system where it would not otherwise be able to. However, you have to convince others that it will be useful. As obvious as it is for a professional system administrator, there are others who will not see the usefulness of IPMI. This process – making the case for a technology – comes up more than most system administrators might realize.

There are numerous situations that require having a person to be actually present at a machine to do things – such as entering the setup, accessing the UNIX/Linux maintenance shell, changing boot devices, and so forth. So how do you prove how beneficial implementing full IPMI support is?

Provide a business case – and perhaps an informal user story – to show how IPMI can reduce the need for a person to actually be present.

To really make the case for IPMI, compute the actual costs in making a trip to the data center – the hourly cost of the administrator, the driving costs in gasoline (both to and from!), and the costs associated with handling the expense report. Report the other unquantifiable costs – the cost of an administrator unavailable for other tasks during the four hour round trip, including projects being delayed and problems not getting resolved. Combine this with a user story.

For example, create a user story around a possible kernel panic. A user story requires an actual user – an individual – whose story is followed. Here is our example continued as a user story:

Alma received an email that a system (db20) was unresponsive. Checking the system found that there was no response from the network at all, and checking the KVM showed that there was a panic on the display. No keystrokes were accepted by the system, and there was no way to power-cycle the system.

So Alma sends an email stating that she will be unavailable for the rest of the day, and calls her babysitter to take come and take care of her three children for the evening. Then she gets into her car and drives the two hours to the data center and parks in a lot ($10 for one hour). She power-cycles the machine by pressing its front panel power button, and checks the system response using her laptop. She finds that the server is responding and logs in.

Then Alma checks the server: logs show no problems restarting, the system has restarted cleanly, the subsystems are all running, and the monitoring system shows all systems are good.

Alma leaves the data center and drives the two hours back home.

If Alma is paid $60,000 yearly, the cost of her time spent on this event is US$144.23. If she drove 320 miles round trip at .76 cents a mile, she gets US$243.20 as an expense – in addition to the US$10 in parking fees. This makes a total direct cost of US$397.43.

If something like this happens six times a year, then the total yearly cost is US$2384.58 – and total downtime for the server is 24 hours for an uptime rate of 99.72%.

This account doesn’t include the indirect costs – such as projects being delayed because Alma was unable to work on them, nor does it include the personal costs involved such as babysitting and time away from family. It also doesn’t include the time that HR staff spent on yet another expense report. It also doesn’t include the costs associated with the server being unavailable for three hours.

On the other hand, Polly received word that another server in the data center was unresponsive, and also found that the kernel had panicked and there was no response from the console. She then used a command line tool to access the baseboard management controller (BMC) through IPMI. With an IPMI command, she rebooted the server, and watched the response on the KVM. Checking the system over, Polly found that the server had booted cleanly, subsystems were operational, and all looked good.

If Polly is paid the same amount as Alma, and her response took 15 minutes, we get a total cost of US$7.21.  Downtime was reduced by 92% (along with an associated reduction in costs tied to the server being down). If this happens to Polly six times a year, the total yearly cost is US$43.27 – and a downtime of 1.5 hours for an uptime rate of 99.98%.

Thus, IPMI and SoL would have saved Alma’s company US$2377.37 per year.

The strongest case can be made if a recent event could have been solved with the technology you are proposing. If you can point to a situation that could have been resolved through ten minutes instead of hours or days without – then the usefulness of the technology will be apparent.

With this user story and business case, the case for IPMI should be readily apparent to just about anybody. Similarly, the case can be made for other technologies in the same way.

Bringing the Network up to (Gigabit) Speed

When looking at increasing network speed in the enterprise, there are a lot of things to consider – and missing any one of them can result in a slowdown in part or all of the network.

It is easy enough to migrate slowly by replacing pieces with others that support all of the relevant standards (such as 10/100/1000 switches). However, such a migration can bog down and leave old equipment in place and slowing down everyone.

First, determine if the infrastructure can handle the equipment. Is the “copper” of a sufficient grade to handle the increased demands? If not, then the cables will have to be replaced – perhaps with Cat-6 or better – or even fiber if your needs warrant it. Check for undue interference – fiber will not receive interference that copper would.

After the cabling is ready, check the infrastructure – all of it. It can be easy to miss one. Also check the capabilities of all. For example, can the switch handle full gigabit speeds on all ports at once? You might be surprised at the answer.

Once the equipment is in place – make sure that all clients are using gigabit speeds. Most switches should have indicators that tell if a port is running at a gigabit or not.

Make doubly sure that servers are running at full speed, as a slowdown there will affect everyone who uses that server. This becomes doubly important in the case of firewalls because of the impact.

Lastly, don’t forget telco equipment. If the connection to the T1 is still running at 100 megabits, then this will slow Internet access for the entire enterprise down.

One more thing – an upgrade such as this would be a perfect time to get more advanced equipment in house. Just be concious of the corporate budget. In such cases, it also helps to present improvements that the executives can see and experience personally rather than some elusive benefits that only the IT staff will see.

Good luck in your speed improvement project!

Virginia Experiences a Severe IT Outage

The Virginia Information Technologies Agency (VITA) has been dealing with an outage that resulted during a scan by storage technicians for failed components. The outage “only” affects 228 of 3,600 servers – which affects 24 different Virginia departments, including the Virginia Dept. of Motor Vehicles (DMV), the Governor’s Office, the Dept. of Taxation, the Dept. of Alcoholic Beverage Control, and even VITA itself.

No word on how many Virginians are affected, but certainly it will be in the thousands.

According to the Associated Press, the crash happened when a memory card failed and the fall-back hardware failed to operate successfully. From the sound of it, an entire storage array was affected by this – how else to account for the 228 servers affected?

This suggests that the storage array is a single point of failure, and that the memory card was not tested nor its fall-back. There should be some way of testing the hardware, or to have a clustered storage backup.

One of the biggest problems in many IT environments – including states – is budget: having full redundancy for all subsystems is expensive. States are not known for budgets that fill all departmental needs (despite large budgets, departments scrounge most of the time…). Many other data center owners consistently have tight budgets: libraries, non-profits, trade associations, etc.

The usual response to tight budgets is to consider the likelihood of failure in a particular component, and to avoid redundancy in such components. A better way to look at it would be to compare not the likelihood of failure, but the cost of failure. The failure in Virginia’s storage was supposed to be extremely unlikely, but has had a tremendous cost both to Virginia and to Northrup Grumman.

Virginia’s IT was outsourced in part to Northrup Grumman, and was supposed to be a model of privatizing state IT in the nation. However, Virginia’s experience shows how privatizing can fail, and how outsourcing companies do not necessarily provide the service that is desired.

The Washington Post reported on this failure, as well as others in the past. There have been other failures, such as when the network failed and there was no backup in place. Both Virginia and Northrup Grumman should have noticed this and rectified it before it was necessary.

The Richmond Times Dispatch has an article on this outage, and will be updating over the weekend.

Software to Keep Servers Running During Cooling Failures

Purdue has created software for Linux that will slow down processors during a cooling failure in a data center.

While a processor runs, it generates heat. The slower it runs, the less heat it generates. Thus, when the air cooling system in a data center fails, the less heat the better. When thousands of servers are clocked downwards, the heat savings will be tremendous.

With the software from Purdue, a server will slow way down in order to generate the least amount of heat possible. With this change, servers can actually be kept running longer and thus could potentially avoid downtime entirely.

At Purdue’s supercomputing center where this was developed, they’ve already survived several cooling failures without downtime.

Purdue’s situation, however, does appear to have some unique qualities. One is that the software was designed for their clusters, which number in the 1,000s of CPUs – meaning that activating a slow-down can happen across several thousand servers simultaneously. This has a tremendous affect on the cooling in the data center and also becomes easy since all the servers are identical.

With that many servers, the cluster can dominate the server room as well. In a heterogenous environment like most corporate server rooms, software like this would have to be on all platforms to be effective.

The places that slowdown software could be most effective is in large clustered environments, as well as small or homogenous environments. Slowdowns could be triggered by many things: cooling failures, human intervention, or even heating up of the server itself.

Is Your Data Center Power and Cooling Truly Safe?

Recently, Amazon has made the news for several unrelated outages at different data centers. The causes of these outages are very interesting, and provide a lesson for the rest of us.

The most recent affected Amazon EC2 users on 13 May. This outage was caused when a cutover from the loss of utility power failed: a switch failed to activate as it should. To make matters worse, the switch failed because of a misconfiguration that was done by the manufacturer.

This outage is compared to a similar one that affected Rackspace in 2007. During a power outage from the utility (because of a car crash), the power went on and off a couple of times, preventing the cooling apparatus from cooling the data center properly. With the rising heat in the data center, they had to shut down equipment or suffer equipment failure.

Another power loss affected an Amazon data center on 4 May (twice). The day was to involve a switchover from one power substation to another (from the electric utility). During this process, one UPS failed to cutover to the backup generators, resulting in an outage to a number of servers. Later that day, after bypassing the failed UPS, human error caused one of the backup generators to shut down, taking down servers once again.

One of the biggest problems in resolving these problems is money. Until something like this happens to a company, many do not want to put forward the money that it would take to avoid such outages.

What would it take to avoid outages like these? The assumption will be that power outages will occur and colling will fail; how do we handle it? How do we increase reliability?

  • Don’t rely on the UPS to save the data center. Use multiple UPSes or put each rack on a racked UPS.
  • Cluster multiple servers together across data centers in an active/active configuration so that downtime in one data center will be mitigated by another.
  • Pair administrators or team members together to reduce human error. Studies show that the worst evaluator of our own capabilities is ourselves (“We have met the enemy and he is us.”)
  • Make sure that cooling devices can run on backup power. Also verify that cooling devices will run long enough to shut down servers if necessary.
  • Investigate alternative cooling sources, such as external air, or “rack-local” cooling that could run off of UPS power.
  • Don’t trust the manufacturer. Nine times out of ten things might be just fine; that leaves a 10% failure rate. Shoot for 100%!
  • Remove all singular paths of failure. What if a UPS fails? What if an air conditioning unit fails? What if the power fails?
  • Test before things happen! Do you have complete protection from power failures? Test it. Do you have complete protection against cooling loss? Test it.

Even smaller companies with a single on-site data center could take advantage of some of these options without breaking the bank. Clustering many servers with an external data hosting site could be done, or a company could partner with a related company for an external server hosting site.

To save labor costs, one could extend the time needed to complete the tasks without adding any more staff to the project.

One other thing that could be done for reliability’s sake – but is probably cost prohibitive for all but the most voracious and deep-pocketed corporations – add a second power link from a different substation and different links. Power goes “out” from one substation, the other substation will continue to provide power. (Not sure that is even possible…)

Note that monitoring does not take the place of testing. There is a difference between knowing that the data center is without power and preventing the data center from losing power.

System checks do not take the place of testing. The UPS might claim to be working fine – do you trust it? Perhaps it will work just fine – and maybe not. Perhaps it will work, but it is misconfigured.

Test! One of my favorite stories about testing has to do with a technical school that provided mainframe computing services for students and faculty. They tested a complete power cycle of the mainframe each month. One month, they went through their testing, rectifying whatever needed rectifying as they went. Two days later, a power loss to the entire school during a thunderstorm required an active restart of the mainframe – without incident.

Would that have been as smooth without testing? I doubt it.

Google Apps Downtime Report: Perfect Example?

On 24 February 2010, the Google App Engine suffered an outage as an entire data center lost power. The Engine was down for two hours as staff worked feverishly to fix problems after power came back up.

Google released a detailed downtime report which has been called a near-perfect example of a good report. Data Center Knowledge summarized the event well in an article; they have also spoken with Google previously about how they handle outages.

Google also kept people apprised of what was happening during the outage as well.

Google’s handling of a data center outage stands in stark contrast with the handling of a 6 March 2010 outage at Datacom in Melbourne, Australia. The story is just incredible. The data center’s managing director said there was absolutely no outage; customers, the company’s network operations center (NOC), and the press all disagreed – and backed it up with pictures.

Some people seemed upset that pictures were taken inside the data center and published on the Internet and in the press (now cellphones have to be dropped off at the door) – yet, this is what a whistleblower does. If this event had been handled differently, no doubt Datacom would have been better off.

Current Ethernet Not Enough?

At the recent Ethernet Technology Summit, there was grousing going on about the need for more power-conservative switches, for more manageable switches, but most of all for faster Ethernet.

Facebook, for one, spoke of having 40Gbits coming out of each rack in the data center, and related how its 10Gb Ethernet fabric is not enough and won’t scale. There are new standards (100Gb Ethernet and Terabit Ethernet) but they are not yet finalized. Analysts suggest that there is a pent-up demand for 100Gb Ethernet, and the conference bore that out.

Supposedly, there is supp

Largest Data Center Consolidation Ever…

The United States federal government recently announced that there was to be a reduction in the number of data center facilities. The United States CIO Vivek Kundra sent a memo to agencies announcing the move and required preparations for it.

In 1999, there were 432 facilities; eleven years later, the number has nearly tripled to more than 1,100.

Reasons given for the massive reduction include costs and energy efficiency. With the changes in the federal government that happen every four years (otherwise known as “electing a new president”), it should not be a surprise that the consolidation is to happen by 2012.

With the economy as it is currently, any massive change like this will affect many sectors. Data center providers that have inefficient facilities will find themselves losing a major customer if federal agencies leave for other providers.

This shift to more efficient data centers, on the other hand, can spur the building of new data centers: this will affect the building trades in a positive way, and quite likely shift the related data center providers towards a more positive outlook.

Intel is also going through a data center consolidation; they’ve an entire web site dedicated to the process which has valuable information.

HP’s Wind-Cooled Data Center in Wynyard Opens

This is extremely interesting news, and has been covered widely in the business and technology press. HP designed and built a data center in Wynyard Park, England (near Billingham) which uses wind for nearly all its cooling needs.

EDS (purchased by HP) announced the building of the data center early in 2009, and the technology involved already was making news. DataCenter Knowledge had an article on it; ComputerWorld’s Patrick Thibodeau also had a very nice in-depth article on the planned data center. ComputerWorld followed up with an equally comprehensive article when the data center opened recently.

Another extensive and illuminating article was written by Andrew Nusca at SmartPlanet.

What is so interesting about the Wynyard data center?

  • It is wind-cooled, and uses a 12-foot plenum (with the equipment located on the floor above).
  • All racks are white, instead of black: this requires 40% less lighting in the data center.
  • Rainwater will be captured and filtered, then used to maintain the appropriate humidity.
  • The facility is calculated to have a PUE of 1.2 (one of the lowest ever). New energy-efficient data centers typically have a PUE of 1.5 or so.
  • HP estimates they could save as much as $4.16 million in power annually.

These are indeed exciting times for data center technology.