Software to Keep Servers Running During Cooling Failures

Purdue has created software for Linux that will slow down processors during a cooling failure in a data center.

While a processor runs, it generates heat. The slower it runs, the less heat it generates. Thus, when the air cooling system in a data center fails, the less heat the better. When thousands of servers are clocked downwards, the heat savings will be tremendous.

With the software from Purdue, a server will slow way down in order to generate the least amount of heat possible. With this change, servers can actually be kept running longer and thus could potentially avoid downtime entirely.

At Purdue’s supercomputing center where this was developed, they’ve already survived several cooling failures without downtime.

Purdue’s situation, however, does appear to have some unique qualities. One is that the software was designed for their clusters, which number in the 1,000s of CPUs – meaning that activating a slow-down can happen across several thousand servers simultaneously. This has a tremendous affect on the cooling in the data center and also becomes easy since all the servers are identical.

With that many servers, the cluster can dominate the server room as well. In a heterogenous environment like most corporate server rooms, software like this would have to be on all platforms to be effective.

The places that slowdown software could be most effective is in large clustered environments, as well as small or homogenous environments. Slowdowns could be triggered by many things: cooling failures, human intervention, or even heating up of the server itself.

The Wikipedia Outage and Failover

The recent Wikipedia outage shows the problems with a typical failover system. The European data center that served Wikipedia’s servers there experienced a cooling failure, which caused the servers to shut down. This is a typical failure that can occur (though it should be prevented).

The event was logged in the admin logs starting at 17:18 on March 24. All of Wikimedia’s server administration is at wikitech.wikimedia.org.

What happened next extended the outage longer than it should have been: the failover from the European data center to Wikipedia’s Florida data center failed to complete properly.

Certainly, to prevent this failure, the failover (and fail-back) could have been tested further, the process refined, and the tests done routinely.

However, there is another possibility: use an active standby instead. That is, instead of having a failover process kick in when failure occurs, use an active environment where there are redundant servers serving clients.

If you have a failover process, it is a sort of “opt-in” – the servers choose to take over from the failed servers. Thus there is a process (the failover) that must be tested, and tested often to make sure that it will work in a normal situation. Testing also means in many cases that an actual service outage must be experienced. This is an active-passive high availability cluster model: the passive server must be brought online and take over from the failed nodes.

Using an active but redundant environment means that if any server – or data center – dies, then service is degraded slightly and nothing more. This describes an active-active high availability cluster model. There is no need for monthly testing – and perhaps, no testing at all: during upgrade times, the servers can be taken out of service one at a time and the results monitored.

The usual argument against such redundancy is cost: the redundant servers need to be able to take on a particular load, which is thus unavailable to other uses in normal operation. Yet, how much downtime can you experience before you start losing money or public good will?

If Wikipedia had put their servers together so that a failover was not necessary, it might have saved them from going down for several hours.

Sony to kill PS3 Linux Installations on Thursday

Recently, Sony announced that update 3.21 (being released on Thursday 1 April) to the Playstation 3 would remove the “Other OS” option – which means that not only would it become impossible to install Linux on the Playstation 3, but any installation will be inaccessible. According to Sony, this is to make the gaming console more reliable.

When the Playstation 3 was introduced, the company Terra Soft Solutions released Yellow Dog Linux for the PS3 and sold PS3 consoles with Yellow Dog pre-installed – including PS3 clusters. Groups at the University of Massachusetts Dartmouth (with the Playstation 3 Gravity Grid), the University of California Berkeley, and North Carolina State University have all been using PS3 clusters to do computing. Sony Entertainment Spain assisted the Computational Biochemistry and Biophysics Lab in Barcelona, Spain, to create the PS3Grid (now rebranded GPUGrid).

As recently as January 2010, the United States Air Force Research Laboratory in Rome, NY, just announced that they are adding 1700 PS3s to go with the 300+ that they already have clustered (called the TeraFLOPS Heterogenous Cluster).

The PS3 was supposed to be an open platform, even supported by Sony. I wonder what happened. I can’t imagine that the USAF will be happy about this, and I can only hope that cluster administrators see this one coming and can stop it – or there will be some dead clusters.

I’ve been waiting for the prices on old PS3s to come down and my budget to go up just to run Linux on it – now the next update is to kill it. Not nice.

I suspect there will be some lawsuits if this update truly comes to pass.

UPDATE: The Electronic Frontier Foundation has a nice expansive writeup on this. One thing that they note is that a hacker recently discovered a way to crack the security on the PS3 hypervisor (using the OtherOS feature and some soldering), permitting full unrestricted access to the entire PS3 hardware environment. Secondly, the article also notes that Sony pulled something like this with the Aibo robot dog some years back.

Data Centers: Weta Digital, New Zealand

Weta Digital, the special effects company behind Lord of the Rings, King Kong (2005), X-Men, and Avatar is in the news again.

Data Center Knowledge has an article about their data center, as well as another one about it last year.

Information Management also had an article about it, as well as a blog post by Jim Ericson.

HP even has a video about their use of HP blades in their cluster.

Some of the more interesting things about their data center include:

  • The use of water-cooling throughout.
  • Using external heat exchangers to release heat.
  • Using blades in a clustered configuration.

This is just the beginning. While this data center is not as radical as the others discussed here recently, the data center is more in the realm of current possibilities. There are photographs in the current Data Center Knowledge article as well.

A Data Center in a Silo

The CLUMEQ project is designing a supercomputer, and has several sites already built. One of these, a site in Quebec, was built in an old silo that used to contain a van de Graaf generator.

An article in the McGill Reporter from several years ago described the supercomputer installation at Montreal.

The new CLUMEQ Collossus (as the Quebec installation is called) was described in an article in Data Center Knowledge. The design has all of the computers (Sun blades) are in a circle with the core being a “hot core” and the cool air being drawn from the rim.

Are You Ready for the Onslaught? (or Scaling Your Environment)

Is your environment ready for the onslaught that you may or may not be aware is coming your way?

One commonly known example of this is what is called “the Slashdot effect.” This is what happens when the popular site Slashdot (or others like it) links to a small site. The combined effect of thousands of people attempting to view the site all at once can bring it to its knees – or fill up the traffic quota in a hurry.

Other situations may be the introduction of a popular product (the introductions of the Iphone and of Halo 3 come to mind), or a popular conference (such as EAA‘s Airventure, which had some overloading problems).

Examine what happens each time a request is made. Does it result in multiple database queries? Then if there are x requests, and each results in y queries, there will be x*y database queries. This shows that as requests go up, database queries go up dramatically.

Or let’s say each request results in a login which may be held for 5 minutes. If you get x requests per second, then in 5 minutes you’ll have 300x connections if none drop. Do you have buffers and resources for this?

Check your kernel tunables, and run real world tests to see. Examine every aspect of the system in order to see what resources it will take. Check TCP buffers for networking connections, number of TTYs allowed, and anything else that you can think of. Go end to end, from client to server to back-end software and back.

Some of the choices in alleviating pressure would be using caching proxies, clusters, rewriting software, changing buffers, and others.

James Hamilton already has collected a wide number of articles about how the big guys have handled such scaling problems already (although focused on database response), including names such as Flickr, Twitter, Amazon, Technorati, Second Life, and others. Do go check his article out!

Playstation 3 Compute Clusters

Have you heard about the Playstation 3 computing clusters that are starting to pop up? This is no game: it’s the real thing. Apparently the IBM Cell microprocessor (based on the Power architecture) is so powerful that it is leaps and bound above other desktop systems.

The Folding@Home protein-folding project (one I very much appreciate) uses idle computers all over the world to compute protein folding – which will aid in scientific research for cures for Alzheimers, diabetes, and others. This project came out with a client for the Playstation 3 for use in the Folding@Home project, which nodes now surpass all other computing nodes combined in sheer processing power.

On March 8, North Carolina State University announced that professor Frank Mueller had created the first academic Playstation 3 cluster (8 nodes). At the University of Massachusetts, Dartmouth, assistant professor Dr. Gaurav Khanna is running a cluster of eight Playstation 3s to analyze gravity waves from the stars. In Barcelona, Spain, a distributed computing project for biomedical research known as the PS3GRID uses the Sony Playstation 3 exclusively.

Terra Soft (the people behind Yellow Dog Linux, YUM, and the Briq) are now offering Playstation 3 clusters preconfigured in a 6- or 32-node cluster configuration. A single Playstation 3 with Yellow Dog Linux pre-installed is also available.

A Playstation 3 cluster built by Terra Soft was the cover story of the August 1, 2007, Linux Journal.

As might be surmised, Linux runs fine on the Playstation 3: Ravi has a fine summary of the possibilities.

ServiceGuard clusters and Differing Versions

HP ServiceGuard will not run unless all the nodes use the same version of the software. The program cmcheckconf will return an error something like this:

Checking nodes ... Done
Checking existing configuration ... Done
This node is at revision A.11.17.01 of Serviceguard, node cluster2 is at A.11.18.00.
Unable to make configuration changes when a node in the cluster is at a different revision.

It is apparently possible (but not recommended) to run a different version of HP-UX on different nodes. However, this requires that the same version of the JFS filesystem (vxfs) is installed everywhere, and any application libraries that are needed must be installed – and the JFS upgrade may require a new licensing fee.

The recommended path is to run the same version of HP-UX on every system in the cluster, and requires the same version of ServiceGuard on every system in the cluster. The application itself should also be the same version, but if properly written may allow “rolling upgrades” – that is, one system has a different version than the next. A rolling upgrade allows you to keep the system going and responding even during the upgrade.

In summary:

  • Different versions of HP-UX: not recommended (difficult and costly to get right)
  • Different versions of ServiceGuard: no
  • Different versions of application: yes (if written correctly)