Table of Contents
Fetching ...

Efficient Matching with Memoization for Regexes with Look-around and Atomic Grouping (Extended Version)

Hiroya Fujinami, Ichiro Hasuo

TL;DR

The paper tackles catastrophic backtracking (ReDoS) in regex matching by designing linear-time backtracking algorithms for extended regexes that include look-around and atomic grouping. It achieves this by adapting memoization-based backtracking to extended constructs using NFAs with sub-automata, and by extending memoization ranges to record $\mathsf{Failure}(j)$ up to the nesting depth $\nu(\mathcal{A})$ and $\mathsf{Success}$. The authors present separate memoized backtracking algorithms for la-NFAs and at-NFAs, prove their correctness and $O(|w|)$ time complexity, and validate performance gains through experiments on real-world extended regexes. They also survey the prevalence of look-around and atomic grouping in real regex usage, highlighting practical relevance and impact on ReDoS defenses. Overall, the work provides a rigorous, linear-time, memoization-based framework for safe, consistent extended regex matching with strong security implications.

Abstract

Regular expression (regex) matching is fundamental in many applications, especially in web services. However, matching by backtracking -- preferred by most real-world implementations for its practical performance and backward compatibility -- can suffer from so-called catastrophic backtracking, which makes the number of backtracking super-linear and leads to the well-known ReDoS vulnerability. Inspired by a recent algorithm by Davis et al. that runs in linear time for (non-extended) regexes, we study efficient backtracking matching for regexes with two common extensions, namely look-around and atomic grouping. We present linear-time backtracking matching algorithms for these extended regexes. Their efficiency relies on memoization, much like the one by Davis et al.; we also strive for smaller memoization tables by carefully trimming their range. Our experiments -- we used some real-world regexes with the aforementioned extensions -- confirm the performance advantage of our algorithms.

Efficient Matching with Memoization for Regexes with Look-around and Atomic Grouping (Extended Version)

TL;DR

The paper tackles catastrophic backtracking (ReDoS) in regex matching by designing linear-time backtracking algorithms for extended regexes that include look-around and atomic grouping. It achieves this by adapting memoization-based backtracking to extended constructs using NFAs with sub-automata, and by extending memoization ranges to record up to the nesting depth and . The authors present separate memoized backtracking algorithms for la-NFAs and at-NFAs, prove their correctness and time complexity, and validate performance gains through experiments on real-world extended regexes. They also survey the prevalence of look-around and atomic grouping in real regex usage, highlighting practical relevance and impact on ReDoS defenses. Overall, the work provides a rigorous, linear-time, memoization-based framework for safe, consistent extended regex matching with strong security implications.

Abstract

Regular expression (regex) matching is fundamental in many applications, especially in web services. However, matching by backtracking -- preferred by most real-world implementations for its practical performance and backward compatibility -- can suffer from so-called catastrophic backtracking, which makes the number of backtracking super-linear and leads to the well-known ReDoS vulnerability. Inspired by a recent algorithm by Davis et al. that runs in linear time for (non-extended) regexes, we study efficient backtracking matching for regexes with two common extensions, namely look-around and atomic grouping. We present linear-time backtracking matching algorithms for these extended regexes. Their efficiency relies on memoization, much like the one by Davis et al.; we also strive for smaller memoization tables by carefully trimming their range. Our experiments -- we used some real-world regexes with the aforementioned extensions -- confirm the performance advantage of our algorithms.
Paper Structure (4 sections, 1 equation, 5 figures)

This paper contains 4 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: $\sigma\in\Sigma$
  • Figure 2: $\varepsilon$
  • Figure 3: $r_1 \cdot r_2$
  • Figure 4: $r_1 | r_2$
  • Figure 5: $r^\ast$