Regexes: With Great Power Comes a Great Downside
One day, you may find yourself wanting to use regexes to parse a large amount of data. You might even want to build a feature that exposes regexes to your users. Regexes are pretty sweet, and who wouldn’t want to enhance their product with the power of regexes?
But wait! There’s a problem. Regexes can be slow. In fact, if you are using a language with a regex engine that uses the recursive backtracking algorithm, including Java, Python, and Perl, it is possible to craft regexes that have exponential performance characteristics. Those of you who have read my previous two blog posts on regexes know why this happens and how to avoid it. Those of you who have not read them, check them out and also read this enlightening article by Russ Cox.
Protect Your Code from Regex Performance Issues
If you use regexes at scale, especially if you expose them to the outside world, you will likely run into performance issues. Whether they are the innocent mistakes of one of your customers or the malefactions of a troll, it is imperative that you build code that is resilient to poorly performing regexes.
At Loggly, we allow our Pro and Enterprise customers to craft their own regexes to parse their own data. These regexes, along with our standard parsing regexes, run in our infrastructure using our own parsing service. This is a textbook example of when handing the regex reins over to customers can lead to Bad Things™.
This post will describe a couple approaches we took at Loggly to protect ourselves from poorly performing regexes, starting with our first, flawed approach, and ending with the solution that we use today. We’ll do a technical deep dive and end by asking the question: Why was our first solution flawed and what learnings from our experience can be applied to robustness problems in general?
Approach #1: Give Slow Regexes Their Own Lanes
At Loggly we’re very fond of building solutions that involve cordoning off problem customers from our main pipeline. This works beautifully for several of our use cases (e.g. an enormous spike of data coming from one of our customers or an alert a customer has created is taking much longer to process than the alerts of other customers). Thus, our first idea to solve the regex pickle was to implement “lanes.” Lanes in this architecture are a small set of queues where some customers’ data is buffered and then processed by separate pools of threads. In our case, the data was being processed by instances of our parsing service. We had a fast lane for the customers who had performant (or no) regexes, and successively slower lanes that handled slower regexes (i.e. lane one buffered the data from all the customers with somewhat slow regexes, while lane two buffered the data for customers with very slow regexes, etc.).
The parsers had a metrics API that helped us determine when a customer needed to be moved from one lane to another lane. A regex could take a long time to parse, but if matched against a very small set of data, we wouldn’t need to be worried about it. So we came up with a formula to approximate the cost of a regex.
The cost of a regex, or latency score (RL), was determined by the regex’s average time to match (RT), number of events matched by the regex (RN), and the total number of events being matched in the lane (E).
The latency ratio is used to approximate the number of cycles the parsers spend parsing any given regex compared to other regexes.
Managing our regex performance would then just be a matter of checking the latency ratios periodically. If a regex’s latency ratio was too high, we would start buffering its customer’s data into a slower lane. So if a customer had a slow regex, only its data would be processed slowly, while the faster lanes zip along like nothing was the matter. Conversely, if a customer’s latency ratio was below a certain threshold, then we could try moving them back to a faster lane. Problem solved, right?
Well, as it turned out, this approach had some problems. First of all, balancing resources across the different thread pools was a tricky but imperative feature. Heading further down this path would have meant more layers of monitoring, more automation, more logic, more code, testing, and bugs. And furthermore, it was at this point that we realized that our solution was not only complex but also fundamentally flawed: If a slow regex were a disease, what we had built was a quarantine, when what we needed was a cure.
Quarantining the regexes doesn’t really help because we have a finite number of cycles overall that we can devote to parsing. Regexes can take vastly varying amounts of cycles to match, so it’s entirely possible that a regex is created that simply requires more cycles than we actually have in order to match it. Putting the regex into another lane would ameliorate the situation, since fewer customers would be impacted by the slow regex, but the problem would remain unresolved as that customer’s data would come in faster than we could process it. We could supposedly just spin up more parsing capacity whenever we reach our limit, but there’s a huge problem with that: Because of the exponential performance characteristics of our regex engine, a single costly regex could exponentially increase the amount of resources required to match it. In such a case, we would have no choice but to disable the offending regex.
Approach #2: Disable Slow Regexes
And thus, it became evident that we needed a way to disable regexes. Okay, simple right? Just measure how long a regex takes to match and if it is too slow, disable it. Actually though, it’s not that straightforward because there are pathological regexes that, when fed a certain input, will take longer to match than the age of the universe. Fun fact: this is known as a Regex Denial of Service Attack and anyone who exposes regexes to the outside world should protect themselves against such an assault. In order to adequately protect yourself, it is imperative that you be able to break out of matching if a thread gets stuck on a single regex for too long.
There are a couple of options at this point, but we ended up adding the ability to interrupt matching after a configured latency threshold had been exceeded. I’ll walk through our whole solution, written in Java. To break out of matching a regex, we extended the CharSequence class. This special CharSequence is initialized with a flag that will indicate if we need to break out of matching the regex or not.
Then, using the RegexInterruptor that was initialized with this particular parsing thread, matching the regex is just a matter of doing the following:
Finally, a separate monitor thread periodically checks all the RegexInterruptor objects to see if we need to break out of parsing:
This feature allows us to have precise control over how long a regex is allowed to take to match. At this point, we have the cure and the parsers can run in a fully automated way without having any performance issues.
Are Both Approaches Still Necessary?
So, now that we have the cure, do we still need the quarantine? We built the ability to disable rules so that we are protected against egregiously slow regexes. Could the lanes perhaps still be useful for regexes that are slow, but not so slow that they exceed our available parsing resources? Do the benefits of the lanes outweigh the cost of the architectural complexity involved in building and maintaining them?
Let’s say a customer creates a regex that is 50 times slower than the average regex (this is all too easy to do with one too many dot-stars). Most of our customers’ data by itself constitutes a small percentage of our total incoming flow of data, so let’s assume the regex applies to .02% of our overall volume. This new regex, which alone is far less efficient than the rest of our regexes, will affect performance, but only slightly. Were we to process this slow regex along with all the other regexes, we increase overall parsing latency for all customers by about 1%. If the customer upgrades her account, however, and now her regex applies to 2% of our overall volume, parsing becomes twice as slow for every customer. Unless we have that much capacity sitting around or we are willing to spin up twice as many parsers, we simply cannot support such a slow-running regex. Therefore, the line between a regex that is too slow to support and not slow enough to make a difference is quite small.
Therefore, lanes are marginally useful in very specific circumstances: A customer must make a rule that is slow enough to affect performance for other customers but not so slow that we can’t handle the hit. In practice, this rarely happens. The lanes architecture required a hefty amount of monitoring and automated resource allocation to be viable, and due to the number of variables that can affect performance, the solution was tricky to craft, not to mention test. So, we stopped using the lanes and now we rely entirely on disabling regexes when they exceed a latency threshold. This solution was trivial to implement compared to the lanes and has been working beautifully for more than a year now. A very small percentage of our customers ever run into the problem and if they do, it’s usually a matter of replacing a .* with a negated character class or some other easy and effective optimization.
An Alternative Approach: Thompson NFA Algorithm
Before I forget to mention it, looking back on all this with my 20/20 retrospective goggles, another good solution might have been to use a different regex engine. As I mentioned earlier, most modern languages have a regex engine that uses the recursive backtracking algorithm for regex matching, which is the algorithm that has the exponential runtime characteristics. The Thompson NFA algorithm, however, runs in linear time regardless of the regex or input, which essentially erases the performance problems entirely. In the early days of developing Derived Fields, I ran some benchmarks against a few different Java-based, NFA-backed regex engines and found that their performance was consistently 20-30% slower than the java.util regex engine. This may simply have been a result of their maturity compared to the standard Java regex engine, which heavily optimizes performance. We could have used a more mature C-based NFA regex engine and used a JNI wrapper, which is a solution we never really explored. NFA-backed regexes lack certain features, however, that many people have come to expect when using regexes so it would have been a tradeoff. I could write a blog post on this topic alone, so suffice it to say that this road might have also been a successful one had we gone down it.
Regex Performance: Just One More Challenge of Operating SaaS at Scale
Robustness in a web service is usually a hard nut to crack. Some of the hardest problems we solve at Loggly involve protecting ourselves from failures, from spikes in volume, and from anomalous usage patterns to DDoS attacks. An important lesson that we learned from this experience is to carefully consider the problem being solved and to ask ourselves: Are we building a quarantine or a cure? Which would be more appropriate given the problem? Also, a solution should never be discounted because it seems too simple. Some solutions for robustness problems involve big, architectural behemoths that, through sheer force of logical complexity, get the job done. But sometimes, the same problem can be solved with an elegant and simple solution that is no less powerful and no less effective.
Elegant, simple, and powerful: Sounds kind of like regexes themselves, non? Now that regex performance need not stop you from using and exposing regexes to the outside world, go forth with confidence and enhance your products with these powerful, simple little tools!