Loggly's Outage for December 19th
Sometimes there's just no other way to say "we're down" than just admitting you screwed up and are down. We're in the process of rebuilding the indexes of historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.
So What Happened?
Yesterday afternoon all of our machines on Amazon's East region, availability zone 1d, were rebooted by AWS staff for maintenance purposes.
The cause of our failure is what some of you on Twitter are calling "a failure to architect for the cloud". I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes". We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.
While some might go on a rant about how 'normal' failures don't affect 100% of your boxes the truth is that any and everything (including an army of reboot monkeys) can be expected to happen to your servers if you wait around long enough. The trick to being good at running a reliable service is to architect around any number of everythings that could happen to your service and build for it.
In this case we didn't build the workaround simply because the system we run - a combination of 0MQ+Solr+Zookeeper+Loggly Special Sauce - makes it extremely challenging to survive a complete failure with more than 1/2 of the cluster missing. With other challenges facing us, we decided to live with the risk.
So, How Do We Make This Right?
Single instances of Loggly's search cluster can't be spread across multiple availability zones or regions due to the amount of data we push around, latencies between the search nodes, and the lack of support in our system for redundant indexes. We've been OK with those limitations in the past simple because we systematically archive data to S3 when it arrives and we are capable of rebuilding indexes on the fly if we lose one or more indexers.
Our primary method to address this will be to start sharding our customers across multiple Loggly deployments. This will prevent further outages to the entire customer base. We've already been investigating other data centers on both dedicated hardware and other cloud-based services.
Finally, we accept full responsibility for the impact to our customers. We will be in touch with our paid customers sometime over the next week to address compensation for this outage.
We welcome feedback below.
Kord Campbell, CEO