To ELK or Not to ELK? That Is the Question

 

To ELK or Not to ELK? That Is the Question

Why Elastic Stack / ELK Is a Trending Topic in Log Management

Since the time that Elastic launched the ELK stack in 2015 and renamed it Elastic Stack earlier this year, it has generated a fair amount of interest from developers and DevOps teams. Many Loggly customers have evaluated the Elastic Stack on the road to adopting our cloud-based log management service. And as a company with one of the biggest and most complex Elasticsearch deployments ourselves, our engineering team has been paying attention to the evolution of Elastic Stack.

Let’s be honest — Elastic has created a really nice open-source project. In fact, we use the search engine part of the Elastic Stack known as Elasticsearch, as a core part of Loggly’s foundation technologies. Elastic Stack does a lot of very useful things, and it is evolving quickly. As an open-source product, it has that “free” thing going for it. Of course, “free” doesn’t necessarily mean no cost. Quoting from the Free Software Foundation:  

“Free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer.” We sometimes call it “libre software” to show we do not mean it is gratis.

In this post, we’ll look at some factors you should consider when deciding whether to adopt Elastic Stack or a commercial log management service like Loggly.

Homebrew vs Off-the-Shelf?

Let’s try and frame this in a way that (hopefully) we can all relate to: beer! It’s pretty easy to brew your own beer. Spend some money up front on hardware, do some research on how to use it, buy yourself some malted barley, hops, and yeast, and you’re off to the races.

But did you really come out ahead?

Think about your answer as we look at similar issues in log management.

How many internal resources do you want to dedicate to log management?

Regardless of whether you run your Elastic Stack stack on-premise or through a cloud service offered by Elastic, Amazon, or others, using it for log management means dedicating some amount of engineering resources to running it. These include:

  • People who aren’t working on making your own product better
  • Machines that aren’t delivering that product but still need to be managed (and paid for)

And keep in mind, in most organizations, running an internal log management stack is not considered a “gold star” expertise.  You probably won’t get much credit when it’s working but there will be plenty of heat when it’s not.

Elastic Stack customers cite two areas that require specific expertise:

  • Data processing: With Elastic Stack, any pre-processing of your log data requires you to write grok filters. This also means maintaining grok filters as your log data changes over time. On the other hand, Loggly handles parsing automatically for many log types. For any log types we don’t parse, you can easily create your own derived fields without learning a special language. With Loggly, you can define your derived fields through a point-and-click interface or with regexes if you prefer.
  • User management: Most people say this a pain with Elastic Stack. The hosted Elastic Stack services alleviate many of the problems that you face when running on-premise, but you’ll still have to allocate some internal expertise.

It is fairly easy to deploy and run the Elastic Stack components. But you still have to figure out what that deployment looks like: how many machines you need, what configuration they should be, and how much redundancy you want. Similar questions need to be answered with respect to your Elasticsearch topology (e.g., How many nodes? Dataless masters? Hot/warm nodes? How much heap?) and indices (e.g., How many? How many shards? How many Types?).

No big deal, you say? OK, but these things change over time. As you push more logs into Elastic Stack or use it more heavily, you have to manage indices and grow the cluster. Things can, and will, break. New versions will be released, and upgrading your cluster can take planning and time.

You end up being an expert on Elastic, when you could have been an expert on your own product.

If your product is based on the Elastic Stack, that is absolutely a good thing. But if the only thing you use it for is logging, does it really make sense? Those “Expertise Tokens” could have been spent on your own product: How much better, faster, more stable, more scalable could it be if it had had the same amount of time and energy spent on it? How well do you understand your data needs?

You might also like:
No B*LLSH*T Benchmarking

Learn a number of useful benchmarking techniques in this whitepaper by going hands-on with Elasticsearch.

The Pragmatic Logging Handbook

The real question is not how much to log, but how to log.

Elasticsearch can grow to be a complex beast. If you misconfigure it or abuse it, it will misbehave, sometimes quite badly. It’s easy to get all fired up about the power that Elastic Stack gives you, but it’s just as easy to do the wrong thing and end up in some pretty deep, dark places. We’ve learned a lot of lessons the hard way here at Loggly, and our system is more robust because we know as much about what not to do with Elasticsearch as we do about what it’s good at.

For example,  Elasticsearch queries on very large indices can cause out-of-memory (OOM) problems that require a skilled resource (or lots of hardware) to fix. (I analyzed this at some length in an eBook I wrote on benchmarking.) Tapiki has also shared its experience with this same issue (and others).

On top of that, if you really want to get the most from your logs, you’re going to have to dive quite deep into how Elastic takes your logs and turns them into useful data. Elastic is schemaless in the sense that you don’t have to define what your data looks like ahead of time. That’s great, but you do need to know how you want to use that data, then send it in a format that lets you get the most out of it. I’m not going to pretend that this is rocket science, but there are some gotchas that can result in time-consuming trial and error when you’re just starting out.

If you only have a few developers looking at application logs, things might be pretty simple. But what happens when your application goes into production? Or when you want to do analytics to see how it performs? Or when you realize you should also be sending database logs? Or system logs? Or… The nice thing about really flexible, powerful tools is you can always find new things to throw at them. I’ll be honest: Once you see what you can do with a system like this, it becomes kind of addictive. The downside is that you can get lost in transforming that firehose of data into something you can filter, slice, or dice in many different ways.

Interestingly, we get a lot of customers who have been running the Elastic Stack.  What they say tends to be some variant on “We’ve realized that becoming great at running our own logging system is not going to help us gain one additional subscriber to our game platform”.

How well do you understand your log volumes?

As we have mentioned before, our experience with more than 75,000 users has shown us how unpredictable log data can be. When one of our customers has a problem, we can see its log volumes go up by a factor of ten or more. Who hasn’t accidentally enabled debug-level logging in production? This isn’t much of a problem when you’re managing as many logs as we are, but it could be a huge problem if you’re managing a single log management installation. This is the main reason why any log management system needs to be built with a lot of redundancy and a robust queueing mechanism.

Of course, redundancy = iron. The need to have extra hardware on hand is one of the reasons that TCO for Elastic Stack installations can be quite unpredictable. Once people start depending on that installation, failures become exponentially more painful. The first few outages can be waved off as growing pains, but when the outage means a report or dashboard fails at just the wrong time (say, at a C-level strategery meeting), and the only solution is a truckload of servers, things start getting interesting.

Using a hosted ELK service can alleviate some of the unpredictability but can also scale your cost of ownership. We’ve spent the last six years optimizing our search infrastructure, and we’re doing things that don’t make sense to do for a single customer system. If you’re dealing with terabytes of logs per day, the time and effort to squeeze the maximum out of your Elastic Stack might make sense. If not, you should stop and  think about how you spend your time (and money).

How quickly do you need to know about problems?

Once you have your logs in Elastic Stack, you’re going to go exploring to see what’s happening across all of those boxes, apps, and services. And you’re going to find some scary stuff. But you can’t just be sitting in front of your screen hitting the search button every 10 minutes (although that can be lots of fun). Luckily, you can get the system to do the work for you run those searches automatically and have it yell at you when things don’t look quite right.

Alerting. It isn’t just a good idea…

Elastic provides Watcher if you have a subscription, and there are other open-source alerting packages available that use Elasticsearch. What this means is that in order to do alerting, you either have to pay Elastic for a subscription (increasing your TCO) or invest even more time in learning how to use another open-source project. Again, I won’t pretend this is brain surgery, but it is more time, more Expertise Tokens, and another potential upgrade hiccup.

If you’re in development, finding out about problems is probably not a time-critical issue. If you’re running an app in production and supporting many users and/or SLAs, timely alerting is a must. That’s the big reason that Loggly not only offers email alerting but also integrates with the endpoints that you’re already using to operate your application, such as PagerDuty, HipChat, Slack, and VictorOps.

Do you need commercial-grade support and defined SLAs?

When using open-source software (even if it’s hosted), you might be on your own when outages or bugs hit. Developer communities can be extremely responsive, efficient, and fast when it comes to fixing bugs, but they can also be quite the opposite. The bug that brings your instance down might be a rare corner case that nobody cares about, at least not in the short term. Or you might be running an older version at a time when the community has moved on to the next. Maybe there is a bug fix in the newer version, but nobody is interested in backporting it, and you might not have the resources or expertise to do so. These are all things to consider.

You can, of course, pay Elastic to provide support for you (and they really do a great job), but your “free” software suddenly becomes a little more expensive than you thought it was going to be.

How does log management fit into your core business?

In general, the question  of open source versus commercial off-the-shelf (COTS) is not so much a question of which is better or more powerful. In most cases, it comes down to a build versus buy discussion, with the associated questions of TCO, support, and maintenance.

In fact, many commercial logging solutions use open source under the hood, and Loggly is no exception. When you use Loggly, you are indeed taking advantage of Elasticsearch, Apache Lucene, Apache Kafka, and many other open-source software components. The bottom-line question is: Can you invest the money and effort in building and maintaining your own solution? And will that investment provide the best return for your business (and investors)?

If you’re early in your development lifecycle and have more time than capital, the Elastic Stack may be a good choice for you. (At the very least, it should be an interesting journey.) If your application itself uses Elasticsearch, then using it for logging might be less of a stretch in expertise, although many of the cost factors may still apply.

If you’re beyond that early stage though, and not using Elastic in your product, you will find that as an individual user, you probably won’t be able to realize the economies of scale that Loggly offers. You’ll also have to handle maintenance, upgrades, outages, and requests for new data or new ways of parsing old data. For Loggly, our maintenance efforts are financed by selling the solution to thousands of customers, and for those customers all of these things happen as if by magic. And of course, we live and breathe log management. Your developers may have other things they want to do.

To Stack or Not to Stack? It’s Up to You

Let’s go back to our home-brew beer and the question we started with: Did you really come out ahead?

If brewing beer is your hobby, you like experimenting with different techniques, and you have some spare cash to buy extra carboys, a fancy hydrometer, and your own grain mill, then yes, you probably did. But if your beer gets infected, or you have a boilover and have to spend hours cleaning black scum off your stove, then maybe not. And there is a lot of boring stuff you have to do to make really good beer. Cleaning and sterilization, detailed notes so you can actually repeat that great batch, and waiting … waiting … waiting … for the fermentation to finish.

You might be saving money (if you brew a lot), but you will be spending a lot of time.

In your personal life, that choice is yours to make. But when you are at work, things get more complicated.

Time is far more precious than money for most companies. Sometimes, you just need a beer right now! And the pressure to make sure every batch of beer you brew is at least as good as the last builds up. Maybe you should just buy that six-pack of Firestone Wookey Jack?

Scale is also a real problem for many companies. If you’re growing at 10% per month, you’re going to triple in size every year. At some point, you’ve gone from home-brewing to being a microbrewery, and your hobby has become a full time job (and not the one you started with).

I’m not going to argue that home brewing is always a bad choice. Given how much code we have at Loggly that is ours and ours alone, that would be silly. And honestly, mastering the Elastic Stack can make you feel like a superhero (some of us do think that!) But one of the many good reasons why the “As A Service” (i.e. the off the shelf) approach has taken off is that you don’t need to be an expert in Technology X to be able to get huge value from it. In some cases, you don’t need to know *anything* about it. Should you become an Elastic Stack expert, more power to you, but in getting there, what other things did you not do?

If you’re looking at Elastic Stack now, I personally invite you to test drive Loggly. You’ll have 30 days to experience our full feature set and answer the questions I posed in this blog post for yourself.

You might also like:
No B*LLSH*T Benchmarking

Learn a number of useful benchmarking techniques in this whitepaper by going hands-on with Elasticsearch.

The Pragmatic Logging Handbook

The real question is not how much to log, but how to log.


Share Your Thoughts

Top