Table of Contents
Fetching ...

Regular Expressions with Backreferences: Polynomial-Time Matching Techniques

Markus L. Schmid

TL;DR

This work addresses the intractability of matching regular expressions with backreferences by introducing two principled approaches: bounding the number of simultaneously active variables via active variable degree (avd) to achieve polynomial-time matching in the input word, and enforcing memory-determinism (md-MFA) to attain near-linear data complexity with a polynomial overhead. Both approaches are grounded in memory automata (MFA) and leverage a Thompson-like translation from regex to automata to analyze structure and control nondeterminism. The paper establishes formal definitions (avd, savd, md-MFA) and derives rigorous complexity results, including polynomial-time matching for avd-bounded regex, $O(|w|p(| ||))$-type matching for md-MFA, and $(co)NP$-hardness results for stronger notions and for verifying certain properties (e.g., savd and synchronised-ness). These findings reveal a meaningful trade-off between expressiveness and tractable matching, showing that memory-deterministic variants strictly extend classical regular expressions while enabling efficient algorithms, albeit with nontrivial checks for determinism.

Abstract

Regular expressions with backreferences (regex, for short), as supported by most modern libraries for regular expression matching, have an NP-complete matching problem. We define a complexity parameter of regex, called active variable degree, such that regex with this parameter bounded by a constant can be matched in polynomial-time. Moreover, we formulate a novel type of determinism for regex (on an automaton-theoretic level), which yields the class of memory-deterministic regex that can be matched in time O(|w|p(|r|)) for a polynomial p (where r is the regex and w the word). Natural extensions of these concepts lead to properties of regex that are intractable to check.

Regular Expressions with Backreferences: Polynomial-Time Matching Techniques

TL;DR

This work addresses the intractability of matching regular expressions with backreferences by introducing two principled approaches: bounding the number of simultaneously active variables via active variable degree (avd) to achieve polynomial-time matching in the input word, and enforcing memory-determinism (md-MFA) to attain near-linear data complexity with a polynomial overhead. Both approaches are grounded in memory automata (MFA) and leverage a Thompson-like translation from regex to automata to analyze structure and control nondeterminism. The paper establishes formal definitions (avd, savd, md-MFA) and derives rigorous complexity results, including polynomial-time matching for avd-bounded regex, -type matching for md-MFA, and -hardness results for stronger notions and for verifying certain properties (e.g., savd and synchronised-ness). These findings reveal a meaningful trade-off between expressiveness and tractable matching, showing that memory-deterministic variants strictly extend classical regular expressions while enabling efficient algorithms, albeit with nontrivial checks for determinism.

Abstract

Regular expressions with backreferences (regex, for short), as supported by most modern libraries for regular expression matching, have an NP-complete matching problem. We define a complexity parameter of regex, called active variable degree, such that regex with this parameter bounded by a constant can be matched in polynomial-time. Moreover, we formulate a novel type of determinism for regex (on an automaton-theoretic level), which yields the class of memory-deterministic regex that can be matched in time O(|w|p(|r|)) for a polynomial p (where r is the regex and w the word). Natural extensions of these concepts lead to properties of regex that are intractable to check.

Paper Structure

This paper contains 16 sections, 18 theorems, 6 equations, 7 figures, 1 algorithm.

Key Result

Lemma 2

Given $w \in \Sigma^*$ and $M = (Q, \Sigma, \delta, q_0, F) \in \mathop{\mathrm{\mathsf{MFA}}}\nolimits(k)$ with $|\delta| = \mathop{\mathrm{O}}\nolimits(|Q|)$, we can decide $w \in \mathop{\mathrm{\mathcal{L}}}\nolimits(M)$ in time $|Q||w|^{\mathop{\mathrm{O}}\nolimits(k)}$.

Figures (7)

  • Figure 1: The syntax tree $\mathop{\mathrm{\mathcal{T}}}\nolimits(\alpha)$ for $\alpha = (x\{\mathtt{a} \mathtt{b} \mathop{\mathrm{\vee}}\nolimits \mathtt{c}\} \mathop{\mathrm{\vee}}\nolimits (\mathtt{a}^+ \mathtt{b}^+)) \: (\mathtt{a} \mathop{\mathrm{\vee}}\nolimits \mathtt{b}) \: x^+$.
  • Figure 2: Example $\mathop{\mathrm{\mathsf{MFA}}}\nolimits(2)$.
  • Figure 3: $\mathop{\mathrm{\mathcal{H}}}\nolimits(\alpha)$ for $\alpha = ((z\{\mathtt{a}^+\mathtt{b}\} x\{\mathtt{b}^+\}) \mathop{\mathrm{\vee}}\nolimits (x\{\mathtt{a}^+\}\mathtt{c} x)^+) x ((y\{\mathtt{a}^+\mathtt{b}^+\}y) \mathop{\mathrm{\vee}}\nolimits (u\{\mathtt{c}^+\}\mathtt{a} u)) z x\{\mathtt{a}^+\}\mathtt{b} x$. Every $t \in \mathop{\mathrm{\mathcal{T}}}\nolimits(\alpha)$ is represented by a grey square labelled with $\mathop{\mathrm{\mathsf{type}}}\nolimits(t)$, which contains the nodes $t^{\mathop{\mathrm{\mathsf{in}}}\nolimits}$ and $t^{\mathop{\mathrm{\mathsf{out}}}\nolimits}$ (and $t^{\mathop{\mathrm{\mathsf{m}}}\nolimits}$ if $\mathop{\mathrm{\mathsf{type}}}\nolimits(t) = [\cdot]$). In this way, the picture also implicitly shows $\mathop{\mathrm{\mathcal{T}}}\nolimits(\alpha)$ (which also implicitly determines the omitted edge labels of $\mathop{\mathrm{\mathcal{H}}}\nolimits(\alpha)$).
  • Figure 4: Illustrations for the proof of Thm. \ref{['memdethardnessTheorem']}.
  • Figure 5: $\mathop{\mathrm{\textsc{SyncMembership}}}\nolimits$
  • ...and 2 more figures

Theorems & Definitions (35)

  • Example 1
  • Lemma 2
  • Proposition 3
  • Lemma 4
  • proof
  • Theorem 6
  • proof
  • Theorem 7
  • proof
  • Corollary 8
  • ...and 25 more