Quick intro to regexp internals: think DFA

Automata theory is such a great science. No matter how abstract it can be, you can find it everywhere. From simplest recognizable langage to the most complex problems, you might want to use automaton. Today, I would like to discuss about these automata in regular expressions.

Source : xkcd

For a long time, I used “traditional” regexp engines. Mostly java.util.regex and python's re module — oh, I can also add grep, sed, and lisp engine within emacs to this list. I use regular expressions every day, mostly for simple use cases, that's true (ie: grep "foo" in my files) but also for powerful strings replacements. Because regular expressions ARE powerful. But you probably know they can be painful as well. Let's begin with this first regexp:

Well, as you might have noticed, this is not real-world regexp, but this can easily lead in a real-world issue. First, we are building a string looking like “aa….a” with 10 times repeated “a” character. Then, our regexp will try to match the a? pattern 10 times, and that will match, but then the engine will realize that the (a){10} statement won't match anything, that's why the perl engine will do recursive backtracking. Now, you're lost. This is an horrible case since the underlying implementation will always choose the positive matching for the a? statement. The engine will have at most n such choices to make, this is how you get your 2^n complexity. How does it behave with n = 25 ?

Of course inefficient regexp can be considered as acceptable when you’re trying to pick up some relevant files, but it’s definitely cumbersome if this kind of regular expression is used for real-time processing (think about filtering website forms, small RPC calls, or even batch job processing thousands of millions of lines of text). In our situation the exponential time is certainly NOT suitable for any case.

Go back to implementations. Basically, most of common engines implement a NFA — Non-deterministic Finite Automaton — a finite state machine where a given input can lead to several possible states (and obviously, it does not know which one is the better). Perl, Java, Ruby and Python for instance implement NFA for processing regular expressions. This implementation is known to be far slower than DFA, its deterministic cousin processing input data in polynomial time. Hopefully the theory wants that NFA can be be transformed to DFA and vice versa.

Here are 2 examples I picked up from this page to illustrate these concepts:

DFA, recognizing the same langage.

The famous grep tool, for example, uses DFA. Want to get a simple comparison with the previous perl program?

Polynomial time looks more reasonable. How can I move from recursive backtracking NFA to DFA implementations in my projects? I’ve heard about re2. This library also uses DFA implementation and describes itself as a “fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library”. By the way you should easily find bindings (or alternatives) for langages other than C++. This website is comparing plenty of libraries and in term of performances, re2 definitely looks interesting. So, why is DFA not used by default? Honestly, I don’t know if it’s the actual reason, but NFA implementations offer more features than DFA, such as back references (e.g. you might want to match something like <(.+)>(.*)</(\1)> to extract XML tags, where \1 refers to the matched opening tag)

If you can’t use DFA, perhaps you can still deduce some good practices from this article. The first one it’s good to know: there is a common syntax allowing you to disable backtracking if you’re sure that you won’t need it:

(?> pattern)

You might also want to disable grouping thanks to

(?: pattern)

if you just need to know if your pattern is matching something/nothing in your input data. That's particularly true with Java and its java.util.regex package. You can find more general good advices on JavaWorld, and lots of them remain true for other langages.


This was just a quick overview of differences between NFA and DFA when implementing regexps. There are many other types of automata that have been used throughout the ages to implement pattern recognition. Regexp world is a fantastic universe mixing data-structures, algorithms, and formal languages. If you want to learn more about history and implementations, you can check my slides on Github, or have a look at the following resources.




Engineering Leader | Author on Decision-Making.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store