The Death of a Server – A Cautionary Tale

One of our hosting clients also has some services hosted… Elsewhere. And Elsewhere has a problem. The symptoms are easy to describe: The high-profile eCommerce installation that’s meant to be live and selling to the world – is dead.

monty_pythonNot resting, not stunned, not pining for anything. Apache: dead, MySQL: dead, even Plesk: Dead. They’re all dead, Dave.

And it really really didn’t have to be this way.

So what happened?

Well, the first Layer5 heard was the customer uttering three words: “Help help help”. After a few jousts with the support operations at Elsewhere it became apparent to him that they weren’t going to help him: It was a dedicated server, and whilst they did rent him the hardware, and it was seemingly a hardware issue, any form of resurrection and Return to Live was his own problem. Absolutely Mr Customer, we have a very robust contract to hide behind.

HollyEnter Layer5. Fact-finding first. Symptoms: MySQL service stopped and won’t start. That had taken down the Interspire (the local-install forerunner to BigCommerce) installation, and also the actual Parallels control panel. OK so the eCommerce operation is out, and the tools to investigate the outage – are out.

Time for Layer5 to enter the Elsewhere support jousting arena. We quickly established that the mySQL startup was returning lots of device errors, so looks like we might have a hard-drive playing up. Ok, let’s hot swap it and see where we get to.

There is no hot swap. There is no RAID.  Backups are the customers responsibility. We have a very robust contract to hide behind. Can we close the ticket now?

Deep Breath

We’ll come back to the implications of the lack of resilience in this installation in a little while….

My first impression of Elsewhere support is that it is staffed with reasonably competent but very very busy technicians. They can answer questions about whats wrong and how it should be fixed. They don’t know about, nor have any focus on, the commercial outcome. They answer technical questions with technical answers and close tickets.

I need to get this eCommerce site back on its feet, says I. You need to restore your server, have a nice day, can we close the ticket now? say they.

StressSo we have a nice fresh blank server now, with a replaced hard-drive. The original hard-drive is offered up in a USB caddy and connected to the server, and Layer 5 proceeds with the salvage operation.

The upshot is this is an old fashioned bad-sector on a hard-drive. This stuff happens all the time. It shouldn’t ever be responsible for taking everything out, though. That’s why RAID was invented, that’s why one takes backups.

The bad sector on the disc was slap bang in the middle of the InnoDB file (tech speak for the previously beating heart of the database), so there was no way back for the MySQL DB underneath Interspire.

We managed to obtain a SQL dump from “a little while back”, and combined with the web file-system from the defunct disc we have a reasonable facsimile of the site back up.

Counting The Cost

There are missing orders (the database doesn’t contain any of the recent orders made), missing products, missing customers, missing CMS content updates. All in all a few weeks work, irretrievable.

The commercially squeamish should look away now… it isn’t easy reading.

  • downConversion Rate x Average Order Value x Daily Visitors = the lost revenue.
  • Add to that the fact that they shopped elsewhere.
  • Add to that the fact that they might never come back.
  • Add to that the reputation damage by having your site off air for 72 hours.


Learning Points

There are a few things here.

learnThe customer in this case rented a dedicated server, from a big corporate Elsewhere and I don’t think it’s too big a sin to assume that came with modern disaster recovery mechanisms built in.

But it wasn’t asked, and therefore it wasn’t told. So what the customer ended up with was a server with a couple of discs, but no RAID array (which protects against one of these discs failing in exactly the manner in this case), and criminally no backup strategy in place at all.

The Elsewhere support team prides and markets itself on technical capability. If you ask me, that’s mandatory but nowhere near enough. The Layer5 holistic approach is more about steering clear of technical minutiae (we Just Do that behind the scenes) and instead focusing on the commercial result.

Our support quest is to get your eCommerce site selling, and we don’t need you to worry about which disc is failing to mount! We’ll close our ticket when you’re back online.

And just for the record…

  • All Layer 5 servers come with resilience built in.
  • We won’t put a site or a server out there without RAID.
  • We won’t put a site or server out there without a backup strategy in play from Day 1: it will (at least) be 5 rolling days, with 13 rolling monthly’s.
  • …and the Return To Live won’t be 72 hours.

If you have an concerns about your own eCommerce platform, regarding any of the issues contained in this blog – get in touch – 0161 850 4545 /