Virginia Experiences a Severe IT Outage

The Virginia Information Technologies Agency (VITA) has been dealing with an outage that resulted during a scan by storage technicians for failed components. The outage “only” affects 228 of 3,600 servers – which affects 24 different Virginia departments, including the Virginia Dept. of Motor Vehicles (DMV), the Governor’s Office, the Dept. of Taxation, the Dept. of Alcoholic Beverage Control, and even VITA itself.

No word on how many Virginians are affected, but certainly it will be in the thousands.

According to the Associated Press, the crash happened when a memory card failed and the fall-back hardware failed to operate successfully. From the sound of it, an entire storage array was affected by this – how else to account for the 228 servers affected?

This suggests that the storage array is a single point of failure, and that the memory card was not tested nor its fall-back. There should be some way of testing the hardware, or to have a clustered storage backup.

One of the biggest problems in many IT environments – including states – is budget: having full redundancy for all subsystems is expensive. States are not known for budgets that fill all departmental needs (despite large budgets, departments scrounge most of the time…). Many other data center owners consistently have tight budgets: libraries, non-profits, trade associations, etc.

The usual response to tight budgets is to consider the likelihood of failure in a particular component, and to avoid redundancy in such components. A better way to look at it would be to compare not the likelihood of failure, but the cost of failure. The failure in Virginia’s storage was supposed to be extremely unlikely, but has had a tremendous cost both to Virginia and to Northrup Grumman.

Virginia’s IT was outsourced in part to Northrup Grumman, and was supposed to be a model of privatizing state IT in the nation. However, Virginia’s experience shows how privatizing can fail, and how outsourcing companies do not necessarily provide the service that is desired.

The Washington Post reported on this failure, as well as others in the past. There have been other failures, such as when the network failed and there was no backup in place. Both Virginia and Northrup Grumman should have noticed this and rectified it before it was necessary.

The Richmond Times Dispatch has an article on this outage, and will be updating over the weekend.

Software to Keep Servers Running During Cooling Failures

Purdue has created software for Linux that will slow down processors during a cooling failure in a data center.

While a processor runs, it generates heat. The slower it runs, the less heat it generates. Thus, when the air cooling system in a data center fails, the less heat the better. When thousands of servers are clocked downwards, the heat savings will be tremendous.

With the software from Purdue, a server will slow way down in order to generate the least amount of heat possible. With this change, servers can actually be kept running longer and thus could potentially avoid downtime entirely.

At Purdue’s supercomputing center where this was developed, they’ve already survived several cooling failures without downtime.

Purdue’s situation, however, does appear to have some unique qualities. One is that the software was designed for their clusters, which number in the 1,000s of CPUs – meaning that activating a slow-down can happen across several thousand servers simultaneously. This has a tremendous affect on the cooling in the data center and also becomes easy since all the servers are identical.

With that many servers, the cluster can dominate the server room as well. In a heterogenous environment like most corporate server rooms, software like this would have to be on all platforms to be effective.

The places that slowdown software could be most effective is in large clustered environments, as well as small or homogenous environments. Slowdowns could be triggered by many things: cooling failures, human intervention, or even heating up of the server itself.

Novell and VMware Team Up

VMware announced in June that Novell’s SUSE Linux Enterprise Server (SLES) will be shipped with every copy of VMware’s vSphere product. In addition, VMware sales staff will have incentives to sell SLES. During the recent sales call by Novell, they expanded on the details of the enhanced partnership.

According to VMware’s page for SLES on VMware, it also sounds as if current vSphere customers would be eligible for a supported copy of SLES as well.

This is incredible news – it means that SUSE may be able to gain some traction in the data center. I’ve been partial to SUSE in some ways ever since I found that XFS (and JFS!) had been supported in SUSE Linux for years before Red Hat did – SUSE has always supported technologies first, providing more value than Red Hat did.

I also supported SUSE Linux in the data center in the past; it has been rock solid (as is Red Hat). SUSE Linux has a lot to offer – as does OpenSUSE (which just recently introduced 11.3).

Red Hat has always done well – as it should – but SUSE has been in the shadows for too long.

It has also been noted that VMware could be a company that buys SUSE and Novell’s Linux business. VMware was bought by EMC not that long ago. Cisco also has a joint venture with EMC that includes VMware products. Is it possible that Cisco will be shipping products with SLES on them?

OpenSolaris is Officially Dead

We saw this coming.

As of 23 August 2010, the OpenSolaris Governing Board (OGB) has stepped down; Ben Rockwood posted the resolution on his blog.

Oracle’s previous email to staff shows that Oracle has no interest in keeping OpenSolaris going, and now there is no one minding the store.

The next step lies with Illumos, the new torch-bearer for open source Solaris. Nexenta, the commercial UNIX based on OpenSolaris and a GNU userland will probably be the first to use Illumos (the project has close ties to Nexenta) – and Belenix may be next, although Belenix development seems to be quite slow (there is no corporate sponsorship and the community seems to be small). Belenix has the tougher problem, as they use a Solaris-based userland.

A New Init for Fedora 14

Apparently, a new project (to replace init, inetd, and cron) named systemd is nearing release and will be used to replace upstart in Fedora 14 (to be released in November – with Alpha Release due today!).

There is a healthy crop of init replacements out there, and the field is still shaking out. Replacing init – or specifically, System V init and init scripts – seems to be one of those never-ending projects: everyone has an idea on how to do it, no one can agree on how.

Let’s recap the current crop (excluding BSD rc scripts and System V init):

I am still waiting for the shakeout – it bugs me that there are dozens of different ways to start a system, and that none of them have taken over as the leader. For years, BSD rc scripts and System V init have been the standard – and both have stood the test of time.

My personal bias is towards SMF (OReilly had a nice article on it) and towards simpleinit – but neither has expanded like upstart has.

So where’s the replacement? Which is The One? It appears that no one is willing to work within a promising project, but rather starts over and creates yet another replacement for init, fragmenting the market further.

Lastly, if the current init scheme is so bad, why hasn’t anything taken over? Commercial UNIX environments continue to use the System V scheme, with the sole exception of Solaris which made the break to System Management Facility (or SMF). Why doesn’t HP-UX or AIX use SMF or Upstart if the current environment is horrible?

Sigh. It’s not that the current choices of replacement are bad – it’s just that there are so many – and more keep coming up every day. Perhaps we can learn something about the causes of this fragmentation from a quote from a paper written about the NetBSD rc.d startup scripts and their design:

The change [in init] has been one of the most contentious in the history of the [NetBSD] project.

Canonical Kills Ubuntu Maverik Meerkat (10.10) for Itanium (and Sparc)

It wasn’t long ago that Red Hat and Microsoft released statements that they would no longer support Itanium (with Red Hat Enterprise Linux and Windows respectively). Now Canonical has announced that Ubuntu 10.04 LTS (Long Term Support) will be the last supported Ubuntu on not only Itanium, but Sparc as well.

Itanium has thus lost three major operating systems (Red Hat Enterprise Linux, Windows, and Ubuntu Linux) over the past year. For HP Itanium owners, this means that Integrity Virtual Machines (IVMs) running Red Hat Linux or Microsoft Windows Server will no longer have support from HP (since the operating system designer has ceased support).

The only bright spot for HP’s IVM is OpenVMS 8.4, which is supported under an IVM for the first time. However, response to OpenVMS 8.4 has been mixed.

Martin Hingley has an interesting article about how the dropping of RHEL and Windows Server from Itanium will not affect HP; I disagree. For HP’s virtual infrastructure – based on the IVM product – the two biggest environments besides HP-UX are no longer available. An interesting survey would be to find out how many IVMs are being used and what operating systems they are running now and in the future.

With the loss of Red Hat and Microsoft – and now Canonical’s Ubuntu – this provides just that many fewer options for IVMs – and thus, fewer reasons to use an HP IVM. OpenVMS could pick up the slack, as many shops may be looking for a way to take OpenVMS off the bare metal, letting the hardware be used for other things.

If HP IVMs are used less and less, this could affect the Superdome line as well, as running Linux has always been a selling point for this product. As mentioned before, this may be offset by OpenVMS installations.

This also means that Novell’s SUSE Linux Enterprise Server becomes the only supported mainstream Linux environment on Itanium – on the Itanium 9100 processor at least.

From the other side, HP’s support for Linux seems to be waning: this statement can be found in the fine print on their Linux on Integrity page:

HP is not planning to certify or support any Linux distribution on the new Integrity servers based on the Intel Itanium processor 9300 series.

Even if HP doesn’t feel the effect of these defections, the HP’s IVM product family (and Superdome) probably will.

Oracle Sues Google Over Java on Android

Oracle – now having purchased Sun – has sued Google over their custom Java virtual machine for the Android mobile platform. In doing so, Oracle has sent reverberations throughout the open source and Java communities.

Google took the Java APIs and enhanced and changed them – then created a virtual machine (called Dalvik) which runs a custom format executable. This was part of the Android software when it was introduced in November 2007, and there were many complaints about Google’s treatment of Java – including complaints from Sun itself. Google’s response at the time to Sun’s complaints was:

Google and the other members of the Open Handset Alliance are working to help solve fragmentation and supporting the developer community by creating Android, a mobile platform that responds to the needs of the developers, has the backing of industry leaders, and will be available as open source under a nonrestrictive license.

To break that statement down, Google was saying:

  • The Open Handset Alliance (not the Java Community Process or JCP) should be the Java stewards for mobile Java.
  • Android (and Android Java) responds to the needs of the developers.
  • Android is backed by industry.
  • Android is available as open source.
  • Android is available under a nonrestrictive license.
  • Java 2 Mobile Edition (J2ME) has none of these capabilities.

Don’t miss the fact that Google created the Open Handset Alliance at the same time, and serves mainly as a source for Android – though it has in recent days been seen as useless by some.

Sun (now Oracle) has had a mobile version of Java (known as J2ME) since before Android existed – but Google bypassed it (and the Java Community Process or JCP) when it created its own JVM. Dalvik executables, in fact, are created from Java binaries, thus involving Java itself in the process of creation and development.

It appears that Google’s Android Java implementation was a direct attack on the JCP and on J2ME. To use J2ME, Google would have had to license it, as it was not available under a license that would have allowed commercial closed-source development: it was under the GPL, but without the classpath exemption that the J2SE had. Because of this lack of the classpath exemption, any development on the standard J2ME platform would have to be released as source code under the GPL.

This action by Oracle fits perfectly into its public persona: consider that Sun’s Chief Open-Source Officer, Simon Phipps, was not even offered a position at Oracle at all. He is or was on the advisory boards for OpenSolaris, OpenJDK, and OpenSparc. Other distinguished Sun engineers have left, including Kohsuke Kawaguchi (chief developer of Hudson), Charles Nutter and Thomas Enobo (both lead developers of JRuby), Tim Bray (Director of Web Technologies – which includes Java and JRuby), and James Gosling (creator of Java). It is notable that all of these people except Simon Phipps are luminaries in the Java realm at Sun. It is as if the Java engineers left wholesale once Oracle was about to take over.

Coverage of the lawsuit has been extensive. Stephen Shankland over at CNet has a story about why Oracle may have chosen to sue. Stephen O’Grady over at RedMonk may have one of the best in-depth analyses of this conflict out there. Groklaw has committed to following the lawsuit through the courts, and has an excellent introductory piece on the lawsuit. Steven Vaughn-Nichols suggests that this lawsuit is only the beginning, and that JBoss, Apache Jakarta, and the JCP better watch out (though I disagree).

From when Google introduced Android and its associated virtual machine, Dalvik, Stefano Mazzochi had one of the most complete explanations of what Google was doing and its implications.

Ubuntu: dpkg fails with “failed in buffer_read(fd)”

Recently, I was trying to update (and upgrade) Ubuntu Lucid, and received this error while running apt-get:

dpkg: unrecoverable fatal error, aborting:
failed in buffer_read(fd): files list for package `apparmor': Input/output error
E: Sub-process /usr/bin/dpkg returned an error code (2)

The solution was summarized nicely by Vivek Kapoor; he attributes the solution to C.M. Connelly (from 5 May, 2003). One of the nice things about Connelly’s entry is that he shows you how he debugged the problem he had, and how he fixed it; go read the post.

The error message is coming from dpkg, and refers to the “files list for package `apparmor'“. The files list is in /var/lib/dpkg/info; in this case, /var/lib/dpkg/info/apparmor.list. The problem being referred to in the error message is that, for some reason, this file cannot be read.

This file can be recreated if you have the package on hand; if not, you can fetch it with apt-get install -d package (possibly with the --reinstall option if necessary). The package will be downloaded to /var/cache/apt/archives, and even if a reinstall is attempted, the reinstall will fail (through dpkg) even though the download through apt succeeds.

The info file contains lines like this (using the top five lines of apparmor as an example):

drwxr-xr-x root/root         0 2010-03-30 14:59 ./
drwxr-xr-x root/root         0 2010-03-30 14:59 ./sbin/
-rwxr-xr-x root/root    783108 2010-03-30 14:59 ./sbin/apparmor_parser
drwxr-xr-x root/root         0 2010-03-30 14:59 ./etc/
drwxr-xr-x root/root         0 2010-03-30 14:59 ./etc/apparmor/

To recreate the file, pipe the output from dpkg -c debfile – like this:

dgd@cor:/var/cache/apt/archives$ dpkg -c apparmor_2.5-0ubuntu3_i386.deb |sudo tee /var/lib/dpkg/info/apparmor.list >/dev/null

After that, you should be good to go. You might want to check the disk (using fdisk) and perhaps reinstall the package to make sure all files are okay.

I don’t understand how this dpkg problem can last for seven years now; dpkg should be able to cleanly handle the recreation of this file if necessary, and shouldn’t be reporting obscure messages about its internal workings. From a user perspective – and a system administrator perspective – dpkg should automatically recreate the list file if there are problems with it, or even recreate all control files used by the package.

One very interesting tip was hidden in Connelly’s blog post from 2003: you can use less on a Debian package and it will report useful information (here’s an example from first lines of apparmor):

apparmor_2.5-0ubuntu3_i386.deb:
 new debian package, version 2.0.
 size 350314 bytes: control archive= 3944 bytes.
    2338 bytes,    61 lines      conffiles
     360 bytes,    18 lines   *  config               #!/bin/sh
     662 bytes,    15 lines      control
     708 bytes,    10 lines      md5sums
    3577 bytes,   119 lines   *  postinst             #!/bin/sh
    2402 bytes,    90 lines   *  postrm               #!/bin/sh
    1186 bytes,    52 lines   *  preinst              #!/bin/sh
     959 bytes,    32 lines   *  prerm                #!/bin/sh
     421 bytes,     9 lines      templates
 Package: apparmor
 Version: 2.5-0ubuntu3
 Architecture: i386
 Maintainer: Ubuntu Core Developers
 Installed-Size: 2248
 Depends: libc6 (>= 2.8), debconf (>= 0.5) | debconf-2.0, lsb-base, initramfs-tools, debconf
 Suggests: apparmor-profiles, apparmor-docs
 Conflicts: libapache2-mod-apparmor (<< 2.5-0ubuntu2)
 Replaces: apparmor-parser, libapache2-mod-apparmor (<< 2.5-0ubuntu2)
 Section: admin
 Priority: extra
 Homepage: http://apparmor.wiki.kernel.org/
 Description: User-space parser utility for AppArmor
  AppArmor Parser is a user level programs that is used to load in program
  profiles to the AppArmor Security kernel module.

*** Contents:
drwxr-xr-x root/root         0 2010-03-30 14:59 ./
drwxr-xr-x root/root         0 2010-03-30 14:59 ./sbin/

The Death of OpenSolaris Confirmed

Recently, I posted about the future of OpenSolaris and the lack of response from Oracle.

Oracle still has no official response, and has no word on where OpenSolaris is going. However, a memo to Oracle Engineering was leaked and then posted to the OpenSolaris Discussion mailing list (osol-discuss) and was later confirmed by an Oracle employee to the mailing list.

William Yang has a nice write-up on the memo and its salient points; in short:

  • Oracle will no longer let OpenSolaris track Solaris development.
  • Solaris code will stay under the CDDL license.
  • “OpenSolaris” as a distribution will no longer be released.
  • Code will only be released after Solaris is released.

Also interesting is Oracle’s reasons for closing down OpenSolaris:

  • Not enough man-power.
  • Releases Solaris technology to competitors.
  • Prevents users from using Solaris.

Oracle has never been a popular company; most Oracle DBAs in my experience have never been happy with Oracle’s support or licensing, for example. This contrasts with Sun, which has always had a positive image.

In the area of open source, Oracle has always been a champion of closed source, in contrast with Sun which had been a positive open source champion. As a result of this, we are seeing more and more open source projects by Sun either closed down or changed into closed source: consider the closing of Project Kenai (a SourceForge-like site for open source projects), the fears over the future of MySQL, and the death of OpenSolaris.

The OpenSolaris experience under Oracle has echos in MySQL: Monty Widenius, the founder of MySQL, was quite vocal in his opposition to the Oracle purchase of Sun, and expressed his fear that MySQL would become closed source. Perhaps his experience with SAP and MaxDB had something to do with that – MaxDB had been released under the GPL through 7.6, when it was returned back into SAP and became closed source once again.

About the time that Oracle announced its purchase of Sun, Monty began the GPL-licensed version of MySQL, MariaDB which has taken hold, and the European Union mandated that MySQL shall remain dual-licensed. I wonder if MySQL’s fate would have been similar to OpenSolaris if it had not been for Monty.

It would be interesting to track the other open source projects now under Oracle’s umbrella:

  • Java (and OpenJDK), and its add-ons
  • Glassfish (J2EE)
  • MySQL
  • NetBeans
  • Lustre file system

Oracle’s Plans for OpenSolaris Murkier than Ever

The controversy around the future of OpenSolaris has been building to a fever pitch these last few weeks, most recently leading to the creation of Illumos, a new open source kernel tree based on the open source portions of OpenSolaris.

Way back in July of 2009, Steven Vaughn-Nichols suggested that OpenSolaris would wither on the vine through deliberate neglect by Oracle – and this seems to be happening (whereas his prediction of the same treatment for MySQL and VirtualBox seems to be misplaced). Then in February of 2010, Ben Rockwood wrote an open letter to Oracle about the future of Solaris and OpenSolaris.

Oracle’s most recent response (during an interview with ServerWatch) has been to state that development on Solaris continues apace, and that Solaris 11 is due out by the end of 2011. Most notable was the lack of any discussion on the future of OpenSolaris.

A few months ago, the OpenSolaris Governing Board – in effect, the people in charge of the details of operating the OpenSolaris community and its resources – are willing to resign en masse if Oracle does not talk to them; Peter Tribble (a member of the OpenSolaris Governing Board) talks about this action in his blog.

I agree with those that say that Oracle can do what it likes, and the threat made by the board is empty – not because of the threat itself, but because it will accomplish nothing, and has no effect on Oracle. If Oracle wants OpenSolaris to go away, it doesn’t matter what the OpenSolaris community thinks. The Governing Board simply has no leverage with Oracle.

No word on how this action will affect Belenix; while Nexenta is basically the OpenSolaris kernel plus a Debian/GNU userland, Belenix is an OpenSolaris kernel plus a mostly Solaris userland. The primary founder of Belenix (Moinak Ghosh) is on the OpenSolaris board; one of the other developers (Sriram Narayanan?) blogged about the board’s action shortly after it was taken in July. Perhaps Belenix would use the Illumos kernel as well?

However, the prospect of OpenSolaris living on in the form of Illumos is promising, and technologies that are part of the open source OpenSolaris will not be lost. Nexenta has already stated its interest in Illumos; this is perhaps because Nexenta relies on OpenSolaris (with its now doubtful future) for its kernel. Thus, it is perhaps no surprise that a Nexenta engineer is the driving force behind Illumos, and neither is it a surprise that Illumos is currently a kernel only.

So now – how long before we see a Debian/Illumos project? Or is that Nexenta now?

Follow

Get every new post delivered to your Inbox.

Join 114 other followers