Table of Contents
Fetching ...

Improved Extended Regular Expression Matching

Philip Bille, Inge Li Gørtz, Rikke Schjeldrup Jessen

TL;DR

This work addresses extended regular expression matching with operators such as intersection and complement, significantly improving both time and space over prior approaches. It introduces match graphs and a parse-tree clustering framework that, together with boolean matrix operations and matrix multiplication, reduces the dominant cubic terms to $O(n^ω k)$ while achieving $O(n^2 \log k / w + n + m)$ space through space-efficient traversal schemes. A key contribution is the generalization to general match graphs and a modular, cluster-based algorithm that can incorporate a black-box non-extended regex matcher, yielding broad applicability and near-optimal bounds for almost all parameter regimes. The results have practical impact for large-scale pattern matching and data processing tasks where extended operators are essential.

Abstract

An extended regular expression $R$ specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression $R$ and a string $Q$, the extended regular expression matching problem is to decide if $Q$ matches any of the strings specified by $R$. Extended regular expressions are a basic concept in formal language theory and a basic primitive for searching and processing data. Extended regular expression matching was introduced by Hopcroft and Ullmann in the 1970s [\textit{Introduction to Automata Theory, Languages and Computation}, 1979], who gave a simple dynamic programming solution using $O(n^3m)$ time and $O(n^2m)$ space, where $n$ is the length of $Q$ and $m$ is the length of $R$. Since then, several solutions have been proposed, but few significant asymptotic improvements have been obtained. The current state-of-the art solution, by Yamamoto and Miyazaki~[COCOON, 2003], uses $O(\frac{n^3k + n^2m}{w} + n + m)$ time and $O(\frac{n^2k + nm}{w} + n + m)$ space, where $k$ is the number of negation and complement operators in $R$ and $w$ is the number of bits in a word. This roughly replaces the $m$ factor with $k$ in the dominant terms of both the space and time bounds of the Hopcroft and Ullmann algorithm. We revisit the problem and present a new solution that significantly improves the previous time and space bounds. Our main result is a new algorithm that solves extended regular expression matching in \[O\left(n^ωk + \frac{n^2m}{\min(w/\log w, \log n)} + m\right)\] time and $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ space, where $ω\approx 2.3716$ is the exponent of matrix multiplication. Essentially, this replaces the dominant $n^3k$ term with $n^ωk$ in the time bound, while simultaneously improving the $n^2k$ term in the space to $O(n^2)$.

Improved Extended Regular Expression Matching

TL;DR

This work addresses extended regular expression matching with operators such as intersection and complement, significantly improving both time and space over prior approaches. It introduces match graphs and a parse-tree clustering framework that, together with boolean matrix operations and matrix multiplication, reduces the dominant cubic terms to while achieving space through space-efficient traversal schemes. A key contribution is the generalization to general match graphs and a modular, cluster-based algorithm that can incorporate a black-box non-extended regex matcher, yielding broad applicability and near-optimal bounds for almost all parameter regimes. The results have practical impact for large-scale pattern matching and data processing tasks where extended operators are essential.

Abstract

An extended regular expression specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression and a string , the extended regular expression matching problem is to decide if matches any of the strings specified by . Extended regular expressions are a basic concept in formal language theory and a basic primitive for searching and processing data. Extended regular expression matching was introduced by Hopcroft and Ullmann in the 1970s [\textit{Introduction to Automata Theory, Languages and Computation}, 1979], who gave a simple dynamic programming solution using time and space, where is the length of and is the length of . Since then, several solutions have been proposed, but few significant asymptotic improvements have been obtained. The current state-of-the art solution, by Yamamoto and Miyazaki~[COCOON, 2003], uses time and space, where is the number of negation and complement operators in and is the number of bits in a word. This roughly replaces the factor with in the dominant terms of both the space and time bounds of the Hopcroft and Ullmann algorithm. We revisit the problem and present a new solution that significantly improves the previous time and space bounds. Our main result is a new algorithm that solves extended regular expression matching in time and space, where is the exponent of matrix multiplication. Essentially, this replaces the dominant term with in the time bound, while simultaneously improving the term in the space to .

Paper Structure

This paper contains 21 sections, 8 theorems, 3 equations, 5 figures.

Key Result

Theorem 1

Given an extended regular expression $R$ of length $m$ containing $k$ extended operators and a string $Q$ of length $n$, we can solve the extended regular expression matching problem for $R$ and $Q$ in space $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ and time

Figures (5)

  • Figure 1: Parse tree for the extended regular expression $(\neg((a|b)^*) b)\cap (a b(b|c)^*)$.
  • Figure 2: Thompson's recursive NFA construction. The regular expression $\alpha \in \Sigma \cup \{\epsilon\}$ corresponds to NFA $(a)$. If $S$ and $T$ are regular expressions then $N(ST)$, $N(S|T)$, and $N(S^*)$ correspond to NFAs $(b)$, $(c)$, and $(d)$, respectively. In each of these figures, the leftmost node $\theta$ and rightmost node $\phi$ are the start and the accept nodes, respectively. For the top recursive calls, these are the start and accept nodes of the overall automaton. In the recursions indicated, e.g., for $N(ST)$ in (b), we take the start node of the subautomaton $N(S)$ and identify with the state immediately to the left of $N(S)$ in (b). Similarly the accept node of $N(S)$ is identified with the state immediately to the right of $N(S)$ in (b).
  • Figure 3: Example of the dynamic programming algorithm for $Q=cabbabcb$ and $R=(\neg((a|b)^*) b)\cap (a b(b|c)^*)$. The parse tree of $R$ is shown in Figure \ref{['fig:parse_tree']}. The match graph for each node $v$ in the parse tree is labeled by the subexpression $R(v)$. The match graph of the root node of the parse tree (bottom) tells us that the only matching substring in $Q$ is $Q[5,8] = abcb$.
  • Figure 4: In $a)$ is shown the clustering of the parse tree of an extended regular expression. The dark gray nodes are the nodes from $P$. In $b)$ is the automaton corresponding to cluster $C_3$ in Figure $a)$.
  • Figure 5: Illustration of the matching paths represented by the four match graphs computed by simulating $A_C$.

Theorems & Definitions (12)

  • Theorem 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • ...and 2 more