Loggly's Outage for December 19th
Sometimes there's just no other way to say "we're down" than just admitting you screwed up and are down. We're in the process of rebuilding the indexes of historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.
So What Happened?
Yesterday afternoon all of our machines on Amazon's East region, availability zone 1d, were rebooted by AWS staff for maintenance purposes.
The cause of our failure is what some of you on Twitter are calling "a failure to architect for the cloud". I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes". We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.
While some might go on a rant about how 'normal' failures don't affect 100% of your boxes the truth is that any and everything (including an army of reboot monkeys) can be expected to happen to your servers if you wait around long enough. The trick to being good at running a reliable service is to architect around any number of everythings that could happen to your service and build for it.
In this case we didn't build the workaround simply because the system we run - a combination of 0MQ+Solr+Zookeeper+Loggly Special Sauce - makes it extremely challenging to survive a complete failure with more than 1/2 of the cluster missing. With other challenges facing us, we decided to live with the risk.
So, How Do We Make This Right?
Single instances of Loggly's search cluster can't be spread across multiple availability zones or regions due to the amount of data we push around, latencies between the search nodes, and the lack of support in our system for redundant indexes. We've been OK with those limitations in the past simple because we systematically archive data to S3 when it arrives and we are capable of rebuilding indexes on the fly if we lose one or more indexers.
Our primary method to address this will be to start sharding our customers across multiple Loggly deployments. This will prevent further outages to the entire customer base. We've already been investigating other data centers on both dedicated hardware and other cloud-based services.
Finally, we accept full responsibility for the impact to our customers. We will be in touch with our paid customers sometime over the next week to address compensation for this outage.
We welcome feedback below.
Kord Campbell, CEO
gba 19 Dec, 2011 06:21pm
We too had to suffer through the great AWS rebootpocalypse of 2011, and it wasn’t pleasant.
As a former sysadmin of 10+ years, I feel your pain. I always think of MacGuyver when stuff like this happens. That is, no matter how many tools you’ve got on your toolbox at home, odds are you’ll be taken hostage in a 3rd world prison camp when sh*t goes down, with nothing more than a stick of gum and a bobby pin to rely on.
Good luck!
Blake Irvin 20 Dec, 2011 06:29pm
As a former AWS user, I’ve been pretty pleased with performance and support from Joyent. Better for business-critical apps or intense workloads than AWS, even though the learning curve is a wee bit steeper.
JakePeacock 3 Feb, 2012 12:50pm
Use Cassandra, it is built for multiple datacenter replication.