I was one of three speakers at the Lucene/Solr meetup last month, co-sponsored by salesforce and Lucid Imagination. I don’t know how anyone at salesforce with a window gets any work done, considering the view – take a look at Grant’s photo to see what I mean. Thanks to Bill from salesforce for hosting, and the guys at Lucid for organizing things. You can check out the two other talks here, as well as talks from previous meetups.
UPDATE: I’ll be doing a slightly expanded version of this talk at Lucene Revolution in Boston on October 8th, incorporating some of the stuff I talk about below.
I got a few interesting questions and comments after the talk, so I thought I’d expand a bit on what was in my slides, which were perhaps a little dense.
“Log Search is highly skewed”
In the talk, I said that the most important search data is the most recent. When you have a problem, you’re far more likely to care about what happened in the last few minutes or hours or days than what happened a month ago. Thats not say that you’ll never need to search older data, just that most of the time, you won’t.
After the talk, though, it became obvious that I should also have said that our users are likely to use search in a way that is also pretty skewed when compared to “normal” search products. Basically, we expect that most people will use the system somewhat sporadically, but that when they do, its likely to be a pretty intensive session of bug hunting. So instead of a fairly continuous search load, we get random spikes for a small subset of all the data we have in Solr. This is actually good for us, because we don’t need to keep all of the shards for all of our customers “hot” in Solr. When a customer shows up, we can warm their data quickly, and let Solr and the filesystem cache do their thing to deal with shards that haven’t been used for a while.
The most important point here is that the overall system is going to be spending the vast majority of its resources on indexing, rather than searching. I can’t give you numbers, but if we end up spending anything more than about 5-10% of our cycles on search, I’ll be very surprised. This is not your typical consumer search product.
I talked a bit about 0MQ, and said that we chose it primarily because its fast and lightweight, even though its possible that we could lose data if things break. I clarified this a bit in a comment on Sarah Allen’s blog because I want to make sure the message is that 0MQ is awesome, not that it loses data. Here’s the guts of what I said…
I wanted to clarify one point in your writeup, though, to make sure people don’t get the wrong idea about 0MQ. Yes, our implementation of 0MQ has a potential “leak”, where we can lose messages, but its a very uncommon case, and the impact is small. Specifically, if one of the solr nodes dies hard, we potentially lose any events that were sent to it in the last batch (0MQ batches to minimize comms overhead). In steady state, 0MQ is rock solid, 100% reliable, and faaaaaast.
Pieter (at iMatix) and I are currently discussing ways to solve the hard death problem, and I don’t anticipate it being a problem very long. As I said in the talk, 0MQ is unbelievably cool – if you haven’t got a project that needs it, make one up!
We sponsored some work to get the SWAP functionality in version 2 of 0MQ, and I’ve been blown away by the guys at iMatix – they really want 0MQ to work, and work well. My throw-away comment prompted an email from Pieter asking for more details, and, as I said to Sarah above, we’re already looking at how to fix it.
Oh, and in case you’re wondering how fast a one-armed paper-hanger is, take a look at what The Word Detective says about it (scroll down till you see the “You missed a spot” section). Maybe I should have used “flat out like a lizard drinking” instead?
The way we create shards by indexing, then merging, then merging again and again and again raised a few questions that are worth repeating…
To recap, we build small (5 minute) shards on our hot indexers. When we stop adding events to them, they get merged with older shards until we hit another size limit (30 minutes). They then get merged with even older shards, until we hit the next time limit (4 hours). And so on up the chain until they cap out at a week long. Along the way, we push indexes from box to box, to balance the load on the system as a whole.
The first question is fairly obvious: Why?
At first glance, it seems like we’re just creating work for ourselves. Surely we could just build the shards and use them as is, right? The problem is that we would have a lot of 5 minute shards floating around the system, and we already know that Solr starts getting cranky when you run a lot of cores in a single instance. So, why don’t we just build bigger shards? The issue there is that with the version of Solr we’re using, we have to reopen the index to make new data available, and we currently do that every 10 seconds (hence the “ NRT + SolrCloud = Our Nirvana” in my slides). Since we have to do this, we’d end up with too many segments in the hot index, or (if we’re not careful with our merge factor) a lot of automatic merging that means that the hot index becomes unavailable for updates for too long for my liking. So, we got pushed into this approach by something that I’m hoping will soon be a thing of the past. I’m really looking forward to Michael Busch’s talk at Lucene Revolution which promises to remove the “N” from NRT. I’m not sure what is better than nirvana, but I’m hoping to find out soon
We may have been forced into doing things this way, but there is a lot of value in the model we have. In some ways, we’re taking over a part of Lucene (merging) that has been absolutely invaluable, but can sometimes be a little difficult to control. We now have complete control over when and where indexes get merged. I probably should point out that we deliberately don’t do any merging on the 5 minute shards, and that we’re careful with the merge parameters on the larger shards to make the merges that do happen as efficient as possible. The model also gives us a very simple index naming scheme based on time, which means we always know exactly where to find data for a time-constrained query. More on this in a bit…
The next question (from the meetup) was what is the overhead of all this merging?
Rather than give numbers, its worth thinking about whether we’re actually doing anything more than Lucene already does when you start building big indexes. I think the answer to that is that we’re actually just exposing and taking over the automatic behaviour, rather than doing something “extra”. So I think the real overhead is close to zero. Compared to building a bunch of shards in parallel using Hadoop, we’re certainly doing more work, but most of the Hadoop based systems I’ve looked at are geared more towards building indexes from a large existing corpus, rather than dealing with a real time stream.
My final comment on this is that since its all completely configurable, we’re not locked into any of the times I’ve mentioned above. Maybe when we move to NRT, or RT, we can bump the hot shard size up to hours or days, assuming that we’re still in control of merging. We shall see…
Circling back to the first section, where I talked about how skewed we expect our search to be, the time-based shards gives us a very clean way to limit the impact of our search requests. Since we can constrain a search to a specific time period, its easy for us to identify which indexes we need to hit to satisfy the search. Our ideal search is for something in the last few minutes, which can be entirely served out of one or two of the five minute shards. We may have gigabytes or (hopefully) terabytes of index data for the same customer sitting around on our system, but if we can satisfy their request by hitting two small, heavily cached cores, then we’re in great shape. I wonder if life will be so kind to us?
Random aside: Synchronicity
Every now and then, things just come together in strange ways. A couple of weeks ago, Kord and I talked with Diego and Santiago from Flaptor, who are working on IndexTank. Diego and I were at LookSmart together years and years and years ago, but thats not the synchronicity. As we were talking, Diego said they were working on a “Nebulizer” which does automatic distribution of their index in the cloud. The day before the meeting, I’d pulled all of the code that deals with this in our system into a class named “TheDecider” (I’m still wrestling with a way to make misunderestimate() a useful method in this class). That evening I went to a NoSQL meetup, and met someone who is also working on the equivalent for their system. Maybe there is something in the air?