Table of Contents
Fetching ...

The Complexity of Aggregates over Extractions by Regular Expressions

Johannes Doleschal, Benny Kimelfeld, Wim Martens

TL;DR

This work analyzes the computational complexity of evaluating aggregates (Count, Sum, Avg, Min, Max, Quantile) over regular document spanners, modeled via regex formulas with capture variables and weighted VSet-automata. It introduces a unified framework with weight-function classes (Constant-Width, Polynomial-Time, Regular) and spanner representations (VSA and unambiguous variants), establishing a spectrum of tractability results and approximation guarantees. The key contributions include (i) a detailed taxonomy showing when exact aggregation can be computed in polynomial time or reduced to DAG path problems, (ii) hardness results (e.g., #P, OptP) and the necessity of unambiguity or restricted weight representations for tractability, and (iii) viable FPRAS-based approaches in specific nonnegative-weight settings and for certain quantile calculations. The findings offer practical guidance for query planning and approximation in information-extraction pipelines, particularly where evaluating full extraction sets is prohibitive but approximate statistics suffice. The paper also presents a compact DAG representation that links spanner evaluation to path problems, enabling scalable aggregation under favorable conditions and highlighting open questions around broader approximation guarantees and real-world deployment.

Abstract

Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

The Complexity of Aggregates over Extractions by Regular Expressions

TL;DR

This work analyzes the computational complexity of evaluating aggregates (Count, Sum, Avg, Min, Max, Quantile) over regular document spanners, modeled via regex formulas with capture variables and weighted VSet-automata. It introduces a unified framework with weight-function classes (Constant-Width, Polynomial-Time, Regular) and spanner representations (VSA and unambiguous variants), establishing a spectrum of tractability results and approximation guarantees. The key contributions include (i) a detailed taxonomy showing when exact aggregation can be computed in polynomial time or reduced to DAG path problems, (ii) hardness results (e.g., #P, OptP) and the necessity of unambiguity or restricted weight representations for tractability, and (iii) viable FPRAS-based approaches in specific nonnegative-weight settings and for certain quantile calculations. The findings offer practical guidance for query planning and approximation in information-extraction pipelines, particularly where evaluating full extraction sets is prohibitive but approximate statistics suffice. The paper also presents a compact DAG representation that links spanner evaluation to path problems, enabling scalable aggregation under favorable conditions and highlighting open questions around broader approximation guarantees and real-world deployment.

Abstract

Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

Paper Structure

This paper contains 33 sections, 43 theorems, 89 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Proposition 2.5

The above operators preserve the finiteness of the supports. Therefore, they map $\mathbb{K}$-relations into $\mathbb{K}$-relations.

Figures (7)

  • Figure 1: A document $d\xspace$ (top), a span relation $R$ (bottom left) and the corresponding string relation (bottom right).
  • Figure 2: Two example VSet-automata that extract the span relation $R$ on input $d\xspace$ as defined in Figure \ref{['fig:MainExample']}. For the sake of presentation, the automata are simplified as follows: Num is a sub-automaton matching anything representing a number (of events) or range, Gap is a sub-automaton matching sequences of at most three words, City and Country are sub-automata matching city and country names respectively. Loc is a sub-automaton for the union of City and Country. All these sub-automata are assumed to be unambiguous.
  • Figure 3: A document $d\xspace$ (top), a span relation $R$ (bottom left), a $\mathbb{Q}\xspace$-weighted string relation $W$ (bottom middle) and the $\mathbb{Q}\xspace$-weighted string relation $W_R$ resulting from $W,d\xspace,$ and $R$ (bottom right).
  • Figure 4: An unambiguous weighted VSet-automaton over the tropical semiring with initial state $q_0$ (with weight $0$) and accepting state $q_5$ (with weight $0$), extracting three-digit natural numbers captured in variable $x$. Recall that, over the tropical semiring, the weight of a run is the sum of all its edge weights.
  • Figure 5: Inclusion structure of our considered weight functions
  • ...and 2 more figures

Theorems & Definitions (86)

  • Example 2.1
  • Example 2.2
  • Definition 2.3
  • Example 2.4
  • Proposition 2.5: Green et al. GreenKT07
  • Example 2.7
  • Definition 2.8: Weight function
  • Example 2.9
  • Definition 2.10
  • Definition 2.11
  • ...and 76 more