Everything You Know About Search and Logging Is Wrong
Being Chief Search Officer for the world’s most popular cloud-based log management service has its advantages. People want to know what Loggly is doing and how we did it. Our partner New Relic asked if I would share my experience with an audience of 800+ at #FutureStack13, and I said “let’s do this.”
It’s a great opportunity to reflect and speak on how Loggly came together by looking back over a decade and a half of my learning in logging and search, and clarify where I believe it is headed next. I’ve been incredibly lucky to have worked with a number of phenomenally talented people over this time, so although this is written as my story, and my epiphanies, it is really a testament to those people from whom I’ve learned so much.
If you’re one of the lucky people at #FutureStack13, I’ll look forward to seeing you at my session tomorrow (Friday) at 2:30 p.m. If not, here is a summary of what I’ll be saying, expanding on the contents of the slides, which should be up sometime soon.
1997-03 – The early years
I stumbled into search at LookSmart, and almost immediately thought “Whoa! This looks like fun!” Having spent far too long in previous jobs dealing with the constraints that come along with using traditional databases, the idea of being able to use data without having to worry about how to normalize it, or how to deal with a schema change was an epiphany that I’ve carried with me since that time. Because there weren’t a whole lot of options at the time, we had multiple custom engines, which were incredibly challenging to work on, across many dimensions. Some of these systems were quite small, and some were huge (for the time) – running 900 machines really makes you think hard about scalability, reliability, and performance. This was the time that I really started to appreciate just how hard it is to run distributed systems without a really solid set of tools to let you troubleshoot, monitor, and analyze its behavior.
2004 – Hints of true big data
In 2004, Doug Cutting showed up at LookSmart to talk about Lucene and Nutch. The beauty of Lucene was that it performed incredibly well, and was very nicely architected and implemented. I was lucky enough to be able to use it for a couple of projects at LookSmart, and was impressed enough that it became hard for me to convince myself that I’d be better off rolling my own engine. The projects I used it for were fairly small scale, but complex enough that we were pushing its boundaries and forcing me to dive into the internals. That diving stood me in good stead for what was to come later.
While working on Lucene at LookSmart, it occurred to me that maybe it would be a better tool for some of the troubleshooting, monitoring, and analysis work that I needed to do. Unfortunately, it wasn’t quite there yet: committing data frequently was too expensive, and numerics support was pretty much non-existent. But the seed had taken root…
2005-2006 – Distributed everything
At Technorati, I worked on the Blog search engine, and we ended up building a pretty sophisticated layer on top of Lucene to support a fully distributed, time sliced index. Lucene provided very little support for this, so we spent a lot of time on plumbing. Sounds pretty boring, but is actually incredibly challenging to get right. We were running well ahead of the main Lucene development – many of the things we implemented didn’t show up in Lucene or later Solr and ElasticSearch for years. Fun stuff!
One of the biggest technical wins at Technorati was a system wide Spread bus, from which we consumed all of the data that ended up in the indices. This allowed us to index in pretty close to real time, but we weren’t able to serve those indexed documents immediately because Lucene commits were too expensive. The idea that we were no longer dealing with batch updates from some external database, however, was just as big an epiphany as my earlier “Hallelujah! No more Schemas!” one. We used this bus for monitoring, which was hugely valuable. Analytics and troubleshooting was still a little painful, but with guidance from the monitoring, there was a little less hair-tearing.
Lucene still wasn’t quite ready for use as a TMA – even though we knew how to update relatively quickly, numerics were still a problem.
2009 Real-time distributed search
In late 2009, after working on yet another streaming, fast update, distributed search system at Scout, it seemed to me that Search was pretty damn close to being the answer for my TMA toolbox. Having been forced to build that toolbox for every search system I’d worked on, it seemed kind of obvious that combining my biggest frustration with my biggest love was the way to go.
At the time we started Loggly, both SolrCloud and NRT were just over the horizon. We grabbed a copy of Solr, and started work on the first generation Loggly product. We got a lot of the plumbing for free, but because we wanted to do time-based sharding, we still had a ton of work to do. We added plugins that did what we needed: 0MQ for real-time streaming, new query handler for improved search, custom index management for creation, allocation and merging of indices. And then we fed the logs from all of the various components of our system into another deployment – we call it LogFooding, and it has proved invaluable both for TMA and to help us understand how to improve our own product.
One of the things that showed up during our development process was native Lucene support for numeric values. This let us support much faster numeric search, which was invaluable when dealing with JSON events.
2012 – Loggly Gen 2
In late 2012, we started thinking hard about a new generation search product – the first generation system was forcing us to spend too much time on plumbing and not enough on the product itself. By now, SolrCloud was a reality, and ElasticSearch was a serious competitor. As long-time Solr users, we didn’t make this decision lightly – I went into the process pushing for Solr, primarily because we had so much more experience with it. The fact that ES won is a testament to the robustness of the “Elastic” part of its name. In some pretty strenuous torture testing, we were unable to break it in any way that worried us.
Our timing turned out to be pretty close to perfect. Lucene 4 had just been released, so we got all of the performance improvements that came along with that. We also got real NRT, which had been lurking in Lucene for a while. Finally, we also got all of the statistical tools that had been laid out on top of true numerics. The combination of Lucene 4 and all of the ES specific work on top of Lucene has given us a blazingly fast, scalable, and rock-solid foundation to build on. At Loggly, we’re working hard to surface all of that power in a way that is as easy to use and consume as it is powerful. Loggly is focused on the SERVICE in SaaS and distributing the insights from the data to solve operational issues, fast. Yes, I’m addicted to search, logging, and all the intricacies involved to continually optimize it. I think about it almost every waking hour, I do it non-stop so our customers don’t have to.
2014 – Onward
There are still a ton of hard problems we have to solve. The fact that there are now open-source systems that deal with a lot of the plumbing and make it easy to find and analyze semi-structured data doesn’t mean that the laws of physics have been revoked. Nor the laws of software development. Scaling a system to handle billions of events per day is still complicated, and you need a great team to make it happen. We’re incredibly proud of the quantum leaps we’ve made in terms of scale, robustness, and performance, but there is still a lot of work to do to grow even more.
Its easy to become jaded as a software developer, but there is something about working in search that keeps that at bay. A big part of it is that no matter how much you think you know, there are still a surprisingly large number of “Oh Really?!” moments once you start actually using search for real problems. The huge volume and variety of data that we deal with means that we’re always seeing people with new and interesting problems. Solving those problems is always fun
I won’t spoil the punchline of the presentation, and I’m sure some of you will disagree with it, but I hope you’ve enjoyed this journey back in time with me. Hopefully, you’ll have an “Oh Really?!” moment of your own, and come to understand why I’ve enjoyed myself so much for the past decade and a half. This is where I should be extolling the virtues of our awesome system, but it can do the talking all by itself. Even if the only thing you leave here with is a better understanding of the true power of “search”, then my work here is done
Update from my marketing team: here’s my presentation on SlideShare.