Improved Extended Regular Expression Matching

Philip Bille; Inge Li Gørtz; Rikke Schjeldrup Jessen

Improved Extended Regular Expression Matching

Philip Bille, Inge Li Gørtz, Rikke Schjeldrup Jessen

TL;DR

This work addresses extended regular expression matching with operators such as intersection and complement, significantly improving both time and space over prior approaches. It introduces match graphs and a parse-tree clustering framework that, together with boolean matrix operations and matrix multiplication, reduces the dominant cubic terms to $O(n^ω k)$ while achieving $O(n^2 \log k / w + n + m)$ space through space-efficient traversal schemes. A key contribution is the generalization to general match graphs and a modular, cluster-based algorithm that can incorporate a black-box non-extended regex matcher, yielding broad applicability and near-optimal bounds for almost all parameter regimes. The results have practical impact for large-scale pattern matching and data processing tasks where extended operators are essential.

Abstract

An extended regular expression $R$ specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression $R$ and a string $Q$, the extended regular expression matching problem is to decide if $Q$ matches any of the strings specified by $R$. Extended regular expressions are a basic concept in formal language theory and a basic primitive for searching and processing data. Extended regular expression matching was introduced by Hopcroft and Ullmann in the 1970s [\textit{Introduction to Automata Theory, Languages and Computation}, 1979], who gave a simple dynamic programming solution using $O(n^3m)$ time and $O(n^2m)$ space, where $n$ is the length of $Q$ and $m$ is the length of $R$. Since then, several solutions have been proposed, but few significant asymptotic improvements have been obtained. The current state-of-the art solution, by Yamamoto and Miyazaki~[COCOON, 2003], uses $O(\frac{n^3k + n^2m}{w} + n + m)$ time and $O(\frac{n^2k + nm}{w} + n + m)$ space, where $k$ is the number of negation and complement operators in $R$ and $w$ is the number of bits in a word. This roughly replaces the $m$ factor with $k$ in the dominant terms of both the space and time bounds of the Hopcroft and Ullmann algorithm. We revisit the problem and present a new solution that significantly improves the previous time and space bounds. Our main result is a new algorithm that solves extended regular expression matching in \[O\left(n^ωk + \frac{n^2m}{\min(w/\log w, \log n)} + m\right)\] time and $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ space, where $ω\approx 2.3716$ is the exponent of matrix multiplication. Essentially, this replaces the dominant $n^3k$ term with $n^ωk$ in the time bound, while simultaneously improving the $n^2k$ term in the space to $O(n^2)$.

Improved Extended Regular Expression Matching

TL;DR

while achieving

space through space-efficient traversal schemes. A key contribution is the generalization to general match graphs and a modular, cluster-based algorithm that can incorporate a black-box non-extended regex matcher, yielding broad applicability and near-optimal bounds for almost all parameter regimes. The results have practical impact for large-scale pattern matching and data processing tasks where extended operators are essential.

Abstract

An extended regular expression

specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression

and a string

, the extended regular expression matching problem is to decide if

matches any of the strings specified by

. Extended regular expressions are a basic concept in formal language theory and a basic primitive for searching and processing data. Extended regular expression matching was introduced by Hopcroft and Ullmann in the 1970s [\textit{Introduction to Automata Theory, Languages and Computation}, 1979], who gave a simple dynamic programming solution using

time and

space, where

is the length of

and

is the length of

. Since then, several solutions have been proposed, but few significant asymptotic improvements have been obtained. The current state-of-the art solution, by Yamamoto and Miyazaki~[COCOON, 2003], uses

time and

space, where

is the number of negation and complement operators in

and

is the number of bits in a word. This roughly replaces the

factor with

in the dominant terms of both the space and time bounds of the Hopcroft and Ullmann algorithm. We revisit the problem and present a new solution that significantly improves the previous time and space bounds. Our main result is a new algorithm that solves extended regular expression matching in

time and

space, where

is the exponent of matrix multiplication. Essentially, this replaces the dominant

term with

in the time bound, while simultaneously improving the

term in the space to

Improved Extended Regular Expression Matching

TL;DR

Abstract

Improved Extended Regular Expression Matching

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (12)