Table of Contents
Fetching ...

Membership Testing for Semantic Regular Expressions

Yifei Huang, Matin Amini, Alexis Le Glaunec, Konstantinos Mamouras, Mukund Raghothaman

TL;DR

This work investigates membership testing for semantic regular expressions (SemREs), extending classical regex with oracle-driven refinements. It introduces a two-pass approach: a first pass converts SemREs into semantic NFAs (SNFAs) and builds a query graph that encodes relevant oracle interactions; a second pass uses dynamic programming to evaluate the graph and discharge oracle queries. The authors establish a tight time bound $O(|r|^2 |w|^2 + |r| |w|^3)$ with $O(|r| |w|^2)$ oracle calls, show faster performance in non-nested cases, and prove quadratic lower bounds on oracle queries. They relate SemRE membership to triangle finding to argue about hardness and provide experimental evidence showing substantial throughput gains over a DP baseline and reduced oracle usage. The results highlight practical potential for integrating external information sources into pattern matching while clarifying inherent computational limits.

Abstract

SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem: First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $\approx 2 \times$ overhead over the time needed for interaction with the oracle. Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $Ω(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.

Membership Testing for Semantic Regular Expressions

TL;DR

This work investigates membership testing for semantic regular expressions (SemREs), extending classical regex with oracle-driven refinements. It introduces a two-pass approach: a first pass converts SemREs into semantic NFAs (SNFAs) and builds a query graph that encodes relevant oracle interactions; a second pass uses dynamic programming to evaluate the graph and discharge oracle queries. The authors establish a tight time bound with oracle calls, show faster performance in non-nested cases, and prove quadratic lower bounds on oracle queries. They relate SemRE membership to triangle finding to argue about hardness and provide experimental evidence showing substantial throughput gains over a DP baseline and reduced oracle usage. The results highlight practical potential for integrating external information sources into pattern matching while clarifying inherent computational limits.

Abstract

SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem: First, We present a two-pass NFA-based algorithm to determine whether a string matches a semantic regular expression (SemRE) in time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a overhead over the time needed for interaction with the oracle. Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an lower bound on the number of oracle queries necessary to make this determination.

Paper Structure

This paper contains 43 sections, 12 theorems, 36 equations, 13 figures, 3 tables.

Key Result

theorem 6

Pick a SemRE $r$ and let $M_r$ be the SNFA resulting from the construction of Figure fig:alg:snfa. For each string $w$, $w \in \llbracket r \rrbracket$ iff $M_r$ accepts $w$.

Figures (13)

  • Figure 1: Recursive construction of the semantic NFA $M_r$ given a SemRE $r$. The states $s_0$ and $s_f$ of $M_{r \land \langle q \rangle}$ are respectively labelled with the open and close query markers for $q$. The formal construction may be found in Appendix \ref{['app:alg']}.
  • Figure 2: SNFA which accepts strings of the form $\Sigma^* a \langle \text{pal} \rangle$. Assume the query $\text{pal}$ recognizes palindromes. The machine nondeterministically finds an occurrence of the character $a$, and confirms that the subsequent suffix is a palindrome.
  • Figure 3: Two prefix paths $\pi_4$ and $\pi'_4$ from the initial state $s_0$ to the intermediate state $s_2$. The SNFA in question is the one from Figure \ref{['fig:alg:qgraph-defn:complication1']}. Both paths correspond to the same string $w_4 = babca$. The suffix path $\pi_3$ moves the machine from $s_2$ to the final state $s_f$ along the string $w_3 = cb$. Notice that the combined path $\pi_4 \pi_3$ is feasible (so the machine accepts $w_4 w_3 = babcacb$) but $\pi'_4 \pi_3$ is not. The SNFA evaluation algorithm therefore needs to track the indices at which queries were opened along each path through the automaton.
  • Figure 4: Examples of query graphs. \ref{['sfig:alg-qgraph-defn:examples:w4w3-rpal']}: Corresponding to possible parse trees for the string $w_4 w_3 = babcacb$ according to the SemRE $r_{\text{pal}} = \Sigma^* a \langle \text{pal} \rangle$. We indicate the labels on each node $v$ by writing $\operatorname{idx}(v) \mathrel{:} l(v)$. The path through the $\operatorname{open}$ node on the left is feasible if $🔮(\text{pal}, bcacb) = \mathrm{true}$ and the path on the right is feasible if $🔮(\text{pal}, cb) = \mathrm{true}$. $\llbracket G_1 \rrbracket = \mathrm{true}$ if either of these paths is feasible. \ref{['sfig:alg-qgraph-defn:examples:pol']}: Corresponding to the string $w = w_1 w_2 w_3 w_4$ according to the pattern $\Sigma^* \langle q \rangle \Sigma^*$. The unlabelled vertices are all marked $\operatorname{blank}$. Observe how each $\operatorname{open}$ node may be delimited by any subsequently reachable matching $\operatorname{close}$ node. Our construction in Section \ref{['sub:alg:qgraph-build']} exploits similar sharing to produce a query graph with only $O(|r| |w|)$ vertices. \ref{['sfig:alg-qgraph-defn:examples:nest']}: Query graph with "nested" queries. Corresponding to the string $w = babcbc$ and the SemRE $r_{\text{nest}} = \Sigma^* a (\Sigma^* b \langle q' \rangle) \land \langle q \rangle$. $\llbracket G_3 \rrbracket = \mathrm{true}$ iff the Boolean formula $🔮(q, cbcbc) \land (🔮(q', cbc) \lor 🔮(q', c))$ evaluates to $\mathrm{true}$.
  • Figure 5: \ref{['sfig:alg:qgraph-build:qstar:snfa']}: The SNFA $M_{q^*}$ accepts strings of the form $(\Sigma^* \land \langle q \rangle)^*$. Let $w = abc$. The query graph $G_{q^*}^{abc}$ in Figure \ref{['sfig:alg:qgraph-build:qstar:qg']} describes the four possible ways in Equation \ref{['eq:alg:graph-build:motiv-ex']} by which $M_{q^*}$ might accept $w$. Note that edges in the query graph are unlabelled: The red coloured labels $a$, $b$ and $c$ are there simply to hint to the reader that these edges can be thought of as originating from the corresponding $s_2 \to s_2$ transition of the SNFA.
  • ...and 8 more figures

Theorems & Definitions (19)

  • Remark 5
  • theorem 6
  • proof : Proof sketch
  • theorem 8
  • theorem 9
  • lemma 1
  • lemma 2
  • theorem 12
  • lemma 3
  • lemma 4: Path correspondence for gadgets
  • ...and 9 more