Loggly

Close

If you don't know the subdomain for your account, you can retrieve it by resetting your password. If you don't have an account, signup now.

Blog / Article

Loggly's Outage for December 19th

Posted 19 Dec, 2011 by Kord Campbell in Business and Startup

Sometimes there's just no other way to say  "we're down" than just admitting you screwed up and are down.  We're in the process of rebuilding the indexes of historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.

So What Happened?

Yesterday afternoon all of our machines on Amazon's East region, availability zone 1d, were rebooted by AWS staff for maintenance purposes. 

The cause of our failure is what some of you on Twitter are calling "a failure to architect for the cloud".  I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes".  We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.

While some might go on a rant about how 'normal' failures don't affect 100% of your boxes the truth is that any and everything (including an army of reboot monkeys) can be expected to happen to your servers if you wait around long enough.  The trick to being good at running a reliable service is to architect around any number of everythings that could happen to your service and build for it.

In this case we didn't build the workaround simply because the system we run - a combination of 0MQ+Solr+Zookeeper+Loggly Special Sauce - makes it extremely challenging to survive a complete failure with more than 1/2 of the cluster missing.  With other challenges facing us, we decided to live with the risk.

So, How Do We Make This Right?

Single instances of Loggly's search cluster can't be spread across multiple availability zones or regions due to the amount of data we push around, latencies between the search nodes, and the lack of support in our system for redundant indexes.  We've been OK with those limitations in the past simple because we systematically archive data to S3 when it arrives and we are capable of rebuilding indexes on the fly if we lose one or more indexers.

Our primary method to address this will be to start sharding our customers across multiple Loggly deployments.  This will prevent further outages to the entire customer base.  We've already been investigating other data centers on both dedicated hardware and other cloud-based services.

Finally, we accept full responsibility for the impact to our customers.  We will be in touch with our paid customers sometime over the next week to address compensation for this outage.

We welcome feedback below.

Kord Campbell, CEO

  • gba

    gba 19 Dec, 2011 06:21pm

    We too had to suffer through the great AWS rebootpocalypse of 2011, and it wasn’t pleasant.

    As a former sysadmin of 10+ years, I feel your pain. I always think of MacGuyver when stuff like this happens. That is, no matter how many tools you’ve got on your toolbox at home, odds are you’ll be taken hostage in a 3rd world prison camp when sh*t goes down, with nothing more than a stick of gum and a bobby pin to rely on.

    Good luck!

  • Blake Irvin

    Blake Irvin 20 Dec, 2011 06:29pm

    As a former AWS user, I’ve been pretty pleased with performance and support from Joyent. Better for business-critical apps or intense workloads than AWS, even though the learning curve is a wee bit steeper.

  • JakePeacock

    JakePeacock 3 Feb, 2012 12:50pm

    Use Cassandra, it is built for multiple datacenter replication.

Share Your Thoughts

Blog Categories

Search

Loading

Archives by Month