Regexes: The Bad, the Better, and the Best

 

Regexes: Bad, Better, and Best

Tweet This Post Button

A Story about How Just a Few Characters Can Make Such a Big Difference in Performance

Regular expressions are incantations that we developers wield mightily when the time calls. Yet, do we always wield them deftly? Regular expressions are a delicate and precise language. They are crafted with careful deliberation into powerful forces that level text like a perfectly thrown bowling ball, knocking over all 10 pins with an instant and dazzling smash. A regular expression that is naively thrown together is like a drunkard who trips and stumbles over text, clumsily managing to roll the bowling ball down the lane, maybe hitting a pin or two.

What, you ask, would be the difference between these two regular expressions? What is it that makes a good regular expression and a bad one? Well, sit close, and I shall reveal the mechanisms that power these incredible tools. You will see the magnitude of difference between regexes that perform well and ones that don’t, and you shall be the wiser, and your regular expressions the defter.

Before we begin, I should add that this post assumes you have a good understanding of how to construct and use regexes. If you are unfamiliar with regexes, I recommend checking out this website: http://www.rexegg.com/. It contains an in-depth tutorial and plenty of information to get you started. It also describes in detail some advanced techniques, so it’s bookmark-worthy even if you already know a bunch about regexes.

Let’s Compare a Bad Regex and a Better Regex

Consider these two regexes:

.* (.*)\[(.*)\]:.*
[12]\d{3}-[01]\d-[0-3]\d (.*)\[(.*)\]:.*

These regular expressions are written to match text in the following format:

2014-08-26 app[web.1]: 50.0.134.125 - - [26/Aug/2014 00:27:41] "GET / HTTP/1.1" 200 14 0.0005

The important pieces of information being extracted are app and ‘web.1′, i.e., the first and second capture groups.

Which Regex Is Better?

Which regular expression would you say is the bad one and which is the better one? You probably guessed right, so how about a harder question: How much worse would you say the bad one is from the better? What kinds of input would cause the bad regex to perform much worse than the other?

In General, the Longer Regex Is the Better Regex

The second regular expression is indeed better than the first. Why is that? A good indicator is that it is longer. Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.

The faster you can throw out non-matching input, the fewer cycles you waste (not to mention, the more precise your regexes are, the less likely you are to accidentally match text that you didn’t mean to match). In the above example, we know that our matching input will always start with a time stamp. We’ve encoded that fact in the better regex, even though we are not extracting any information from the time stamp itself.

Tweet This Post Button

Non-Matching Input Can Take Far Longer to Process Than Matching Input

When you’re benchmarking your regexes, it is important to keep in mind that non-matching input can sometimes take far longer to process than matching input. With matching input, the regex will usually match at some point in the middle of processing. With non-matching input, the regex may need to try many, many different paths before it can rule out the regex as not being a match.

To demonstrate this, let’s walk through what each regex does with the following non-matching input:

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005

How the Bad Regex Runs Against the Non-Matching Input

The bad regex looks like this, with the main sections color coded:

.* (.*)\[(.*)\]:.*

1. The regex starts with a .* and since the * is greedy, it will first scan the whole event trying to grab as much text as possible.1

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005

2. It will then backtrack from the end until it reaches the first space. It’s now expecting (.*)\[  which is a (possibly zero length, but as long as possible) string of characters occurring before an open bracket.

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                                                                 ⬆

3. It scans 0.0005 and, not having encountered an open bracket, continues backtracking, looking for earlier spaces.

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                                                              ⬆

4. This scanning and rescanning continues with each space until it reaches the third space occurring in the event.

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                                               ⬆
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                                             ⬆
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                                       ⬆
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                             ⬆
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                ⬆

5. It has found a zero-length string of characters followed by an open bracket (note that this means that the first capture group is empty). It moves on to match \[.*\]:. There’s another .*, so it consumes the whole event again and backtracks, this time looking for closed brackets.

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
                                      ⬆

6. It has found a closed bracket but here it fails because it doesn’t see the colon it expected, so it scans back a little more looking for more closed brackets. When it doesn’t find any, it will return to backtracking looking for spaces.

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005

7. It goes to the second occurring space, backtracks a little, and finally ends up back at the beginning.

50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
            ⬆
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
50.0.134.125 - - [26/Aug/2014 00:27:41] \"GET / HTTP/1.1\" 200 14 0.0005
⬆

8. There is nothing else for it to try, so it finally fails to match.

 

This regex resulted in a lot of wasted effort scanning and rescanning. There are so many things we could have done to stop the regex from processing this much text! I will cover some of them later in this post.

How the Better Regex Runs Against Non-Matching Input

Now, consider the second regular expression. It expects the digits 1 or 2 to begin the input text because we know that our input text should start with a time stamp. The input begins with a 5 so the input is not a match. Done and done.

Performance of the Bad and Better Regexes

Now you can see why the bad regex is bad and why the good regex is good when they are run against non-matching input. The second regular expression was able to throw out the non-matching input immediately, whereas the first regular expression wasted tons of effort in determining that the input was not a match. I ran a small performance test2 that feeds in 1,000,000 lines of the non-matching text to each regular expression in turn.3 The bad regular expression took on average 10,100 milliseconds to process all 1,000,000 lines, while the good regular expression took just 240 milliseconds. So, yeah, the bad regex is a lot worse than the good one. Like 42 times worse!!

How the Bad and Better Regexes Perform Against Matching Text

Okay, so that was non-matching text, but what about matching text? When running the same 1,000,000-line test against the matching text from the beginning of this post, the matching input took the bad regular expression an average of 17,800 milliseconds to process, while the better regular expression took 4,700 milliseconds. That’s pretty good, but we could still do a lot better.

Tweet This Post Button

The Best Regex

Now, let me introduce you to the best4 regular expression:

[12]\d{3}-[01]\d-[0-3]\d ([^ \[]*?)\[([^\]]*?)\]:.*

What Makes the Best Regex Better Than the Better Regex?

This third regex avoids almost all of the unnecessary scanning of text that the first and second regexes do so much of. The way the best regex avoids this is by doing two things:

  1. Including a quantifier, as in, a character that quantifies how much should be consumed by a star. I replaced all but the last * with *?
  2. Specifying what characters to match (or not to match) instead of just a dot. I replaced all but the last . with a negated character class.

While sometimes (rarely) .*’s are appropriate (at the end of regex #3), they are generally never what you want because of the excessive scanning of input text that we saw above. In regex #3, I have changed all but the last * to *? which makes the star lazy. This will cause the star to consume as little text as possible and it will scan from the current index from left to right rather than from the end of the input from right to left.

I have also changed all but the last . to a negated character class, i.e [^ \]]. These two changes, in effect, cause the regex to scan from left to right looking for characters that are not spaces or brackets. This vastly cuts down the amount of scanning the regex does, with matching and nonmatching input.

Performance of the Best Regex

When I reran the test, the best regex took about the same amount of time to match the non-matching input, but the matching input took only on average 800 milliseconds to run, as opposed to 4,700 milliseconds for the better regex and a whopping 17,000 milliseconds for the bad regex. That’s almost a 20-time difference to match the same information from the same input text. Wow.

Moral of the Story

Should you ever find yourself writing a latency-sensitive application that makes very frequent use of regular expressions, you could save yourself a lot of cycles by crafting a regex that is as precise as possible. These particular examples were very limited, considering the infinite breadth of possible regexes and inputs. They were intended to demonstrate the magnitude of difference between well-crafted regexes and naive regexes, between the champion bowler and the drunkard. To suit your own use cases, you will need to carefully consider exactly what it is you are trying to match.

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be. There are several techniques that are incredibly useful for speeding up your regexes and in the next installment of this blog post, I will delve into each one. Stay tuned!

Tweet This Post Button

Not all regular expression engines work in the way being described here. The step-through in this post demonstrates the (unoptimized) algorithm that Java, Ruby, Perl, Python, and PHP use, which is the recursive backtracking algorithm. Awk and grep use the Thompson NFA algorithm which is in fact significantly faster in almost every way but supports a more limited set of features. This fascinating article describes the two algorithms in detail: https://swtch.com/~rsc/regexp/regexp1.html.
2 The performance test is not intended to be a formal benchmark, just a quick and dirty demonstration. I ran it on my 2014 Macbook with OpenJDK 1.7. The code for the test can be found here: https://github.com/zzbennett/RegexPerfTest/blob/master/Main.java
Each regular expression is compiled beforehand. Compiling your regexes before repeatedly using them also greatly improves performance.
Technically this regex is still not the “best,” as there are even more optimizations that you can make. Those optimizations, in my opinion, greatly complicate the regex and take quite a lot of time and expertise to craft, but they do boost the performance of the regex by a non-trivial amount. The optimizations I describe in this post give by far the greatest bang for your buck performance-wise while maintaining the readability of the regex. If you’d like to learn more about how to optimize even further, check out this post:http://www.rexegg.com/regex-quantifiers.html#explicit_greed.

Elizabeth Bennett, a Denver native, moved to the Bay Area shortly after graduating from Oberlin College with degrees in Computer Science and Bassoon Performance. While most of her days are spent writing Java for Loggly’s backend services, she does manage to spend a fair amount of time knitting and video gaming. Liz hasn’t been knitting much lately because she is the lead engineer for Loggly Derived Fields, announced earlier this week. 


10 comments

  • Jeff R. Allen

    1 year ago

    You do a good job of explaining a tricky thing. But the first thing that came to my mind as I read this is that this problem is not a good fit for regexps. A simple routine that finds the first space, finds the brackets, and extracts the two substrings would be faster than a regexp. When looking at an optimization problem, the first question has to be, “Is there a possibility I should not be doing this AT ALL?”

    • Liz Bennett

      Liz Bennett

      11 months ago

      Jeff,
      Thanks for your careful read of my post. To answer your question, this would certainly be true if this was the only thing you needed a regex to do. In reality, my regex use cases are much more complex. The simplicity of this regex is more so that it is easier to illustrate my points. You make a good point though. I probably wouldn’t use a regex if all I needed to do was extract two values from fairly regularly formatted input text.

  • dmnc

    1 year ago

    Nice article. Found it through your slashdot video interview. But please… “Asterisk”, not “star”. That bit was distracting in an otherwise technical and informative post.

    • Liz Bennett

      Liz Bennett

      11 months ago

      dmnc,
      I appreciate the feedback. It’s very common to refer to asterisks as stars, especially in reference to regular expressions.

  • Online Cop

    1 year ago

    I appreciate this article. In your “best” example, you mentioned changing the .* (greedy) to .*? (lazy) but as you are using negated character groups, lazy matching seems to actually take fewer steps to process.
    https://regex101.com/r/xJ4mO5/1 shows your “worst” regex,
    https://regex101.com/r/xJ4mO5/2 shows your “better” regex,
    https://regex101.com/r/xJ4mO5/3 shows your “best” regex (which uses the lazy matching), and
    https://regex101.com/r/xJ4mO5/4 shows the .*? changed to simply .*
    Notice the number of steps each takes in order to complete the match.

    • Liz Bennett

      Liz Bennett

      1 year ago

      That’s true that the greedy regex uses fewer steps since the lazy regex does a backtrack for each consumed character. I should mention that depending on the use case and depending on your input, the greedy regex will sometimes be faster, so it’s always worth doing a benchmark.

Share Your Thoughts

Top