Table of Contents
Fetching ...

Skyline Operators for Document Spanners

Antoine Amarilli, Benny Kimelfeld, Sébastien Labbé, Stefan Mengel

TL;DR

The paper defines a skyline operator for document spanners by using domination relations, itself expressed as spanners, to filter mappings to maximal ones. It proves regular spanners are closed under skyline for regular domination rules, while core spanners are not, and demonstrates exponential state blowups and NP-hardness in combined complexity for skyline evaluation. It introduces UMDSDP as a sufficient hardness condition and provides a partial dichotomy for variable-inclusion-like rules, identifying tractable and intractable cases. These results illuminate tradeoffs between expressiveness and computational cost in skyline-enabled extraction, with implications for schemaless contexts and related spanner formalisms.

Abstract

When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.

Skyline Operators for Document Spanners

TL;DR

The paper defines a skyline operator for document spanners by using domination relations, itself expressed as spanners, to filter mappings to maximal ones. It proves regular spanners are closed under skyline for regular domination rules, while core spanners are not, and demonstrates exponential state blowups and NP-hardness in combined complexity for skyline evaluation. It introduces UMDSDP as a sufficient hardness condition and provides a partial dichotomy for variable-inclusion-like rules, identifying tractable and intractable cases. These results illuminate tradeoffs between expressiveness and computational cost in skyline-enabled extraction, with implications for schemaless contexts and related spanner formalisms.

Abstract

When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.
Paper Structure (23 sections, 19 theorems, 12 equations, 1 figure)

This paper contains 23 sections, 19 theorems, 12 equations, 1 figure.

Key Result

Lemma 1

For every regular spanner $P$ and every variable set $X$, there is a functional VA defining $P^{[X]}$.

Figures (1)

  • Figure 1: Extracted mappings before and applying different skyline operators; see Example \ref{['ex:skyline']}.

Theorems & Definitions (39)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Definition 4
  • Example 5
  • Example 6
  • Example 7
  • Example 8
  • Example 9
  • Definition 10
  • ...and 29 more