Membership Testing for Semantic Regular Expressions
Yifei Huang, Matin Amini, Alexis Le Glaunec, Konstantinos Mamouras, Mukund Raghothaman
TL;DR
This work investigates membership testing for semantic regular expressions (SemREs), extending classical regex with oracle-driven refinements. It introduces a two-pass approach: a first pass converts SemREs into semantic NFAs (SNFAs) and builds a query graph that encodes relevant oracle interactions; a second pass uses dynamic programming to evaluate the graph and discharge oracle queries. The authors establish a tight time bound $O(|r|^2 |w|^2 + |r| |w|^3)$ with $O(|r| |w|^2)$ oracle calls, show faster performance in non-nested cases, and prove quadratic lower bounds on oracle queries. They relate SemRE membership to triangle finding to argue about hardness and provide experimental evidence showing substantial throughput gains over a DP baseline and reduced oracle usage. The results highlight practical potential for integrating external information sources into pattern matching while clarifying inherent computational limits.
Abstract
SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem: First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $\approx 2 \times$ overhead over the time needed for interaction with the oracle. Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $Ω(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.
