As your company grows, you will inevitably end up sending more and more log data. Hundreds of gigabytes and even terabytes every day! As someone who needs to get answers from that data, you don’t want big log volumes to create barriers to insight. Loggly Gamut Search™ introduces a new approach for high-performance search that we developed specifically for time-based datasets like log files. It enables our customers to search more data and search that data faster than they could before.
In this video, we (Jon and Derek) talk about why high-performance search matters to log management users, how we designed our search and caching approach, and what these changes mean for our product.
Gamut Search will be rolled out to Loggly customers over the coming weeks. It will be available for all Loggly pricing tiers, including Lite accounts.
If you want to dig even deeper into our solution, read on.
Big Data Searches Strike Fear into the Hearts of Log Management Users
How much do you hate watching a spinner as you wait for search results? Has spinner fear ever prevented you from digging deeper into your data? Do you hesitate to expand and look at your full history, or do you limit yourself to small subsets of your data because you don’t want to wait?
When you execute a search on a multi-gigabyte log dataset, there is a huge amount of work going on inside a search engine like Elasticsearch, and it all has to complete before you see anything. But with log data, we know that a lot of searches are really just probing to try and find something to validate your ideas about what might be going on. We can see this in our own logs, as people refine their queries to drill down on the log events that contain the information that matters.
When we looked at all of the queries coming into Loggly, we found that:
- 51% of searches are repeated with at least partial overlap within 30 minutes
- 43% of searches overlap at least half the time range
- 36% of searches overlap at least 90% of the time range
That data told us that the right caching strategy could have a big impact on performance. To take advantage of that, we also had to look at how queries are executed and whether we could improve things there.
Caching Drives Search Performance on Big Datasets
Anyone looking to build a high-performance search engine knows the importance of caching. It matters a lot because no matter what underlying technology you’re using, the fastest search you can do on any search engine is the one that pulls from cache rather than querying the index.
The Loggly service is based on Elasticsearch (ES), which is designed as a general-purpose search engine. We have been working with ES for a long time now and were one of the first companies to GA a SaaS product (on version 0.90.13). We are also one of the largest ES installations around, with hundreds of terabytes of index, gigabytes of cluster state, and tens of millions of fields in our indices, all servicing more than 10,000 customers.
Elastic has been working on query caching (aka filters) for many years now and has made some significant improvements. But we are a bit of an edge case for them. Our data is hugely varied because we have thousands of customers. Our query stream is equally varied. These two factors mean that the Elasticsearch cache is simply not large enough to be able to keep results live for long enough. By implementing an external cache, we can size it to keep results available for as long as we want, without having to worry about competing for resources within the Elasticsearch JVM.
ES is an incredible open-source, general-purpose search engine. And Loggly wouldn’t be where it is today without ES. But as with any general-purpose technologies, you encounter trade-offs when you apply it to your specific application. For us, those trade-offs came into play with very large data volumes, from a combination of individual large customers plus thousands of smaller customers. We knew we could do a lot better by being smarter about how we search and how we present the results.
How Does Elasticsearch Process a Query?
In ES, every search is distributed across all of the nodes in the cluster that contain data relevant to that search. This approach means that the cluster is working at its maximum capacity to service the request, and it works really well when you have a small amount of data (everything is relative — by small, we mean hundreds of GB).
But when you have a query that needs a huge amount of data, physics is not your friend. Every time you query, you have to wait for all of ES’s work to finish before you see any results. The more logs you generate, the slower your searches are going to be. It’s a constant uphill battle.
Doesn’t Elasticsearch Already Cache?
ES does have its own caching mechanism, which works really well if you have a single-tenant system with a small number of indices. But Loggly is a multi-tenant system, which means that we have to deal with more indices, more variety within those indices, and a more varied query stream. For us, this means that the ES cache fills up quickly and then starts evicting results — even for searches that our users performed 30 minutes earlier. Since the ES cache lives inside the ES JVM, there is also pressure from all of the other activities that are going on within the ES node (such as indexing).
Here’s an analogy. Your browser caches a ton of stuff, and does a really good job when it’s just you using it. But what if 1,000 people used that same browser? Or 10,000? Eventually, the cache is going to fill up. Some of your stuff will inevitably be evicted, and your browser will have to fetch all that stuff again the next time you use it.
Log Management Search Should Take Advantage of Time
As we hinted at above, time is a first-class citizen when it comes to log data. When optimizing performance for log management use cases, we know we can take advantage of the fact that all of our data is organized by time and that old data doesn’t change.
When a user executes a query through Gamut Search, the Loggly backend slices it up into smaller, faster queries based on time. Each slice is handled independently of the others. Search systems all perform better when they’re dealing with smaller volumes of data, and we can reduce the amount of data for any individual search by reducing the time span at which we’re looking.
The first impact of this change is that we can get the first slice of results to our users far quicker than we could previously. Our UI is now far more responsive and fluid than it was before we made these changes.
In addition, we cache those results so we can avoid having to re-execute the search if we see it again. Since this cache lives outside of ES, we can have a very large cache with a much longer lifetime for its contents. If we have search results from yesterday, we can cache those results for as long as we want. Or, for example, if a user does a specific search on the last seven days of data every day, Loggly can deliver most of the results from cache.
The cache has been designed so that we can reuse results for queries that have very different time spans. We can also take advantage of the fact that we only have a short time window (seconds to minutes) of real flux in the data, so we can cache even very recent results.
By implementing an external cache, we can free up heap within the ES JVM and ensure that our cache can provide the longevity we need for our results. It also means, by the way, that we don’t have to worry so much about how many indices we’re searching, or what is in them.
Our search handler and cache are instrumented so that we can see exactly how they are performing, and we will be working on continuous improvements to the system based on that data. We already have some ideas that we’re experimenting with, and we know there are ways to improve performance beyond what we have today.
Backend Caching Has a Big Impact on User Experience
Instead of a waiting for a single response from the backend, Gamut Search now streams our data into the browser, providing immediate user feedback. If just a portion of the results are in cache, we can send those back immediately while the rest of the search is executing. As more and more data comes in, Gamut Search progressively updates the view.
Instead of staring at a blank screen with a spinner on it, you can start analyzing data and planning your next action. We know that for most searches, we can instantaneously show you enough events from the last few minutes to let you decide whether you searched for the right thing. If you got it right, you’re done. If you got it wrong, then you can immediately revise your query and try again. Because Gamut Search is so responsive, you can iterate really quickly, and get to the right results in a lot less time. If you had to wait just 20 seconds for each search to complete, it’s going to be a pretty frustrating experience. We want to remove that frustration.
Gamut Search also includes new controls on the timeline, which make narrowing down or expanding your search much faster and easier. We’ve created a new event view that has JSON support, so you can see the structure of your events clearly, and we’ve improved the highlighting, so you can see everything that matched your search request. You get clearer information, quicker, and you have more control over how to act on it.
Why Not Solve Search Performance with More Hardware?
Of course adding hardware will improve search performance, but our experience tells us that it’s a slippery slope. Taken to an extreme, you could build a system where all of the index data lived in RAM and all searches completed in 10 milliseconds. The problem is that no one would be able to afford that level of performance!
Since Loggly was founded, we have been focused on delivering a product that is cost-effective for companies of any size. To get the same performance gains we achieved with Gamut Search using hardware alone, we would have to add a massive amount of hardware. That would not give us the best value proposition for our customers, either in cost structure or user experience.
If you take the time to think a bit harder about what happens in search, you realize that brute force is not always the best way to solve the problem.
Moving Forward with Search Performance
We’re glad to have Gamut Search out in the market, but we’re already charting our path forward. We have performance data that will tell us how effective each part of the system is, and that will guide us as we iterate on Gamut Search. We expect to be able to improve our caching performance, the way we execute queries, and how we use the data in the browser. Our goal is not just fast performance but to make Loggly as fluid and interactive as possible.
Summary: High-Performance Search Has Dual Benefits
Gamut Search doesn’t sidestep physics, but we know that it does a much better job of getting users what they want, faster. It gets actionable information into their hands as quickly as possible and makes it easier to dig deeper and look over their entire log history. And that helps you make the best quality decisions.
Our underlying approach provides us with:
- Cost effectiveness over the long run as our business continues to grow. A log management system that doesn’t take advantage of the peculiarities of log data will experience escalating costs as log volumes grow and still won’t be able to reach the level of performance that Gamut Search can.
- The power to implement innovative approaches in our user experience that couldn’t be accomplished in a traditional system.
Our new approach to search allows us to deliver functionality and user experiences that just weren’t technically feasible before. It has forced us to think differently. But personally, we’re looking forward to the day when the loading spinner is just a distant memory.