A Single Character Causes Downtime for… WordPress.com!

Last Thursday, an error in the wordpress.com software caused some user settings to be overwritten, which resulted in loss of settings for some customers. The site was taken down for checks, and an hour later, 99% of users were back online.

The cause of the error? A coding error of a single character. Certainly checks and balances are needed, but according to Matt Mullenweg, founder of WordPress.com, they are already using reviews and testing.

It was less than a month ago that Toni Schneider, CEO of Automattic, wrote in glowing terms about the use of “continuous deployment” at wordpress.com. Is this event going to lead to the death of “continuous deployment” at WordPress? I suspect not.

In fact, Paul Graham described in a paper how he used Lisp for Viaweb in just this fashion. Viaweb was bought by Yahoo! and became the Yahoo Store. Viaweb would fully implement features before it had even become mainstream.

Let this WordPress.com downtime be a lesson as to what a single character can do, and also a lesson in how none of us are immune from such mistakes.

Software Bugs: Good or Bad?

Recently, Karl Fogel wrote about bugs and “technical debt” – as a response to a mailing list thread about the future of Subversion in 2010. This resonates with me as I recently found myself struggling with bugs in Ubuntu to find that they would not be fixed (in my case, it was the lack of embedded ROM code for a USB-serial adapter – normally included with the Linux kernel).

Karl’s article was then reported on by Joe Brockmeier of OStatic.

Much of this reporting makes me think of TeX, the typesetting system created by Donald Knuth, one of computing’s “founding fathers” (so to speak… despite coming on the scene later on.). TeX has remained unchanged except for bug fixes for several decades, and shows no sign of slowing down or dying, in contrast to what the articles report.

I also think of Ubuntu distributions contrasted with the Ubuntu LTS (“Long Term Support”) distributions – which mirrors the difference between Fedora and Red Hat Enterprise Linux. It is possible to have a system without bugs – or at least, with few bugs. Fixing the bugs in a rapid and constant fashion will improve the user experience, as well as build up the “Good Will” value of your name rather than being known for bugs that aren’t fixed.

A complaint often heard from users is that the “fix” for a problem a user of a commercial product has is to “upgrade to the new version” (normally with a substantial cost). This should not be the way things are done.

Bug removal should be primary, and solid reliable program operation number one. As a user – and a enterprise user – reliability is primary. A product which proves through history that bugs are second to upgrades and new features will not last long. This is the very reason that products like Ubuntu LTS and Red Hat Enterprise Linux exist.

I agree with several premises in Joe’s article on OStatic – that bug removal should not be the only focus, or that an increase in bug reports is not all bad. More bugs means more users are using the product, and provides a way to make the software more reliable. Users would much rather apply patches and update to a more reliable version than upgrade to something entirely new and with newly introduced bugs not yet fixed.

Using Open Source in the Enterprise: Two Stories

This was interesting. Just recently MIT announced that they would be replacing their Cyrus IMAP infrastructure with Microsoft Exchange. The reason was that the IS Department wanted to offer Exchange – that is, they wanted to provide Microsoft Exchange services to their “customers” (students and faculty). Isn’t it ironic that it is none other than Carnegie Mellon, another educational institution, that maintains Cyrus IMAP? Many students are also upset, as they will no longer be able to use Pine for their email.

This news can be compared to the recent news from the London Stock Exchange: they are dropping their Windows-based trading system for one based on Linux. Of course, they didn’t go out of their way to choose one or the other: but the Windows-based system halted trading for an entire day; the exchange never stated exactly what the cause was, but information was that it was the trading system that was at fault. Now the CEO that brought in the trading system is out without any comment, and the first order of business for the new CEO is to dump the old Windows-based trading system. ComputerWorld has a nice article on it. This shows the reliability of Linux overall and suggests that the reliability of Linux should be a strong selling point.

Next time management starts suggesting replacing Linux with Windows – tell them the story of the London Stock Exchange. They are also not the only ones; go read the article.