Our hosted CRM system, Really Simple Systems, has been running at 99.99% uptime for the last three years, and try as hard as have we just haven’t been able to achieve the magic ‘five nines’ – 99.999%. In plain English, 99.99% uptime means that the system has to be unavailable for a maximum of only one hour a year; 99.999% means down for only 5 minutes a year. So Google’s outage of nearly two hours yesterday, plus previous outages within the past months, is embarrassing for Google, will make business users wonder if GMail is suitable for business, and is generally bad PR for Cloud Computing.
If I look back at the last three years to see what caused our outages, it’s never ‘our’ fault, that is, our servers and software didn’t go down, the datacentre did. The reasons so far have been: power outage at the datacentre (and the standby power supplies failed); somebody unplugged the main router by mistake; a bug in the router software caused the other routers to crash. But customers don’t care whose fault it is, they just want to be able to access the system all day, every day.
We’ve had a failover system operating in another datacentre for some years, but it is difficult to switch all the customers to it, the data hasn’t been 100% up to date and we have to manually tell them the URL so they can log on. So unless the outage can’t be fixed in 30 minutes (and they all have so far) it hasn’t been worth switching.
Having spent six months looking at the various options to insulate us from datacentre failures, only one looked practical: build a complete duplicate failover system in another datacentre; replicate the data there in real time; be able to switch from the main datacentre to the failover datacentre instantly.
It’s expensive to build a redundant datacentre, but as we get bigger so does the inconvenience that an outage causes our customers.
So that’s what we’ve done. The main servers will be in a datacentre in Manchester, the failover servers and DNS hosting in Maidenhead (200 miles south). The data is replicated in real time (using MySQL replication). If the main datacentre goes down we can switch the DNS to the failover datacentre (almost) instantly, if the failover datacentre goes down then the DNS will stay pointing to the main datacentre anyway.
If anybody can see a hole in this logic, let me know!