Table of Contents
Fetching ...

RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement and Lookarounds

Ian Erik Varatalu, Margus Veanes, Juhan-Peep Ernits

TL;DR

RE# presents a derivative-based, nonbacktracking approach to regular expression matching that adds intersection, complement, and lookarounds, while preserving $O(n)$ time performance. The authors formalize the theory with a Lookaround Normal Form and prove a correctness theorem ensuring that the derivative-based matcher yields the same matches as the semantics. The implementation targets .NET9, leverages SIMD and a prefilter, and demonstrates strong empirical performance: baseline benchmarks show RE# achieves roughly a 1.7x throughput of the next fastest engine, and extended benchmarks reveal orders-of-magnitude improvements on complex extended features. The work demonstrates that richer regex expressivity can be realized in practical, linear-time matchers, with potential impact on security-sensitive parsing, policy languages, and data processing.

Abstract

We present a tool and theory RE# for regular expression matching that is built on symbolic derivatives, does not use backtracking, and, in addition to the classical operators, also supports complement, intersection and lookarounds. We develop the theory formally and show that the main matching algorithm has input-linear complexity both in theory as well as experimentally. We apply thorough evaluation on popular benchmarks that show that RE# is over 71% faster than the next fastest regex engine in Rust on the baseline, and outperforms all state-of-the-art engines on extensions of the benchmarks often by several orders of magnitude.

RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement and Lookarounds

TL;DR

RE# presents a derivative-based, nonbacktracking approach to regular expression matching that adds intersection, complement, and lookarounds, while preserving time performance. The authors formalize the theory with a Lookaround Normal Form and prove a correctness theorem ensuring that the derivative-based matcher yields the same matches as the semantics. The implementation targets .NET9, leverages SIMD and a prefilter, and demonstrates strong empirical performance: baseline benchmarks show RE# achieves roughly a 1.7x throughput of the next fastest engine, and extended benchmarks reveal orders-of-magnitude improvements on complex extended features. The work demonstrates that richer regex expressivity can be realized in practical, linear-time matchers, with potential impact on security-sensitive parsing, policy languages, and data processing.

Abstract

We present a tool and theory RE# for regular expression matching that is built on symbolic derivatives, does not use backtracking, and, in addition to the classical operators, also supports complement, intersection and lookarounds. We develop the theory formally and show that the main matching algorithm has input-linear complexity both in theory as well as experimentally. We apply thorough evaluation on popular benchmarks that show that RE# is over 71% faster than the next fastest regex engine in Rust on the baseline, and outperforms all state-of-the-art engines on extensions of the benchmarks often by several orders of magnitude.
Paper Structure (2 sections, 1 figure, 2 tables)

This paper contains 2 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Two evaluations from Section \ref{['sec:evaluation']}. $\mathbf{RE\#}$ is the baseline and $y$-axis is relative slowdown in log scale.