DIY Elastic Stack: The 5 stages of grief
Log management with (ELK) Elastic Stack: instant gratification?
By now, someone you know – a coworker, a boss, or a vendor at a tradeshow – has probably told you how easy it is to set up and use the Elastic Stack (aka the ELK Stack) to search and manage your logs. Just spin up an EC2 instance and install the open source libraries! Or, just swipe your credit card and spin up a cluster from Amazon Elasticsearch or Elastic Cloud. Voila! You will find all the answers you were looking for from your logs.
… or epic journey?
But truth be told, setting up a production-grade Elastic Stack and operating it without glitches doesn’t happen overnight. Think of it as a journey. As your cluster size grows from a few gigabytes to several hundred gigabytes or more and you open up access to users from outside your engineering team, you may gradually realize that it’s not exactly fun.
At Loggly, we hear from new customers every week who have grown tired of managing their own Elastic Stack environment. For these customers, what initially began as a spark of curiosity in a few developers ended up becoming a wildfire that required daily full-time attention from 5-6 people!
As I listen to customer stories about DIY Elastic Stack, I can easily draw a parallel to the Kübler-Ross model of the five stages of grief. If you are about to embark on a DIY Elastic Stack journey, or progressing to a more advanced Elasticsearch implementation, this may help you prepare for some of the challenges ahead.
And we know some of this from our own experience. Loggly runs one of the most complex Elasticsearch implementations around, serving thousands of customers every day.
Your Ops team will work extremely hard to keep your Elastic Stack from going down as your production servers crank up more and more logs. This can mean waking up in the middle of the night to respond to pages about how your storage capacity is maxing out, your heap is low, or your pipeline is experiencing indexing delays. You don’t know what is going on. Then you will wonder whether having your own Elastic Stack is worth it or not. Of course it is, you will assure yourself. We’re just experiencing a few early bumps. Denial will help you ignore the writing on the wall. But after a few months, all Elastic Stack problems you are trying to deny will manifest themselves more openly.
Your developers and Ops team will be angry that they have this “new full-time job” troubleshooting Elastic Stack instead of contributing to your core business. Other developers will be frustrated that they cannot find the logs they need because Elastic Stack is down. Your customers will be dissatisfied as the time to resolve their issues gets longer and longer, resulting in customer churn and lost sales opportunities. Getting angry allows you to channel your frustration. You realize how important a reliable and scalable log management system is for your business and customers.
Your most passionate Elastic Stack supporters within the team will do anything not to feel the pain of the constant problems. You will be shown a few blogs or invited to attend some webinars or conferences on best practices for optimizing your indices, maintaining clusters, and getting high performance out of your Kafka pipeline. If only you changed some setting with discoverability, updated your architecture around pipeline hosting, or had done something differently to treat mapping conflicts, you would have unleashed the true power of Elastic Stack. The Kübler-Ross grief model suggests your team will weave in and out of these stages.The bargaining stage will likely last several months.
Once you have passed the bargaining phase, your senior engineers and Ops team members figure out that sh*t just got real. This stage will feel endless, and your team will wonder a lot. That’s when the first questions will be asked about whether there is any point in continuing the status quo and owning your Elastic Stack. Some managers and executives will step in and ask what is the total cost of ownership (TCO) and if there ever will be some positive ROI on this investment?
This is the stage when you accept the new reality. Some members in your team now begin to recognize that change is required. Note that your team and customers are probably not at all OK with all the problems you are facing with Elastic Stack. What it means is that you understand that owning and running your production-grade Elastic Stack is not easy and some hard decisions need to be made. You must adjust and reorient your team’s priorities. Log management is important, but it’s not your core business. Elasticsearch is very useful and powerful, but operating and maintaining an Elastic Stack cluster is not fun. This is the stage when you finally reach out to others and find the right solution for your business.
There is hope
After getting tired of the burden of maintaining Elastic Stack, many businesses – large and small– choose Loggly to focus on what really matters to them. They rely on our experience and expertise with Elasticsearch to make their logs searchable and make log analysis fast and simple. If you are ready to shift gears, then take Loggly for a test drive and see for yourself how you can escape the five stages of grief of DIY Elastic Stack.
For a deeper look at the considerations around ELK log management, read this great post by Jon Gifford.