A framework for extraction and transformation of documents

Cristian Riveros; Markus L. Schmid; Nicole Schweikardt

A framework for extraction and transformation of documents

Cristian Riveros, Markus L. Schmid, Nicole Schweikardt

TL;DR

The paper addresses how to extract information from text and transform it into new documents by a two-phase ET framework that combines regex multispanners for extraction with polyregular (notably linear) string-to-string transformations. It establishes that linear ET programs are expressively equivalent to nondeterministic streaming string transducers under bag semantics, and proves that linear ET programs are closed under composition with an efficient enumeration algorithm that achieves linear preprocessing and output-linear delay. By introducing multispanners and a unique multiref-word encoding, the authors provide a principled approach to model, manipulate, and enumerate transformed outputs, including duplicates, which are meaningful in the ET pipeline. The work contributes a solid theoretical foundation for practical IE workflows, enabling scalable, compositional, and enumerative extraction-transform processes with strong formal guarantees.

Abstract

We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the span-tuples into new documents. We base the extraction phase on the framework of document spanners and the transformation phase on the theory of polyregular functions, the class of regular string-to-string functions with polynomial growth. For supporting practical extract-transform scenarios, we propose an extension of document spanners described by regex formulas from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans. We prove that this extension, called regex multispanners, has the same desirable properties as standard spanners described by regex formulas. In our framework, an Extract-Transform (ET) program is given by a regex multispanner followed by a polyregular function. In this paper, we study the expressibility and evaluation problem of ET programs when the transformation function is linear, called linear ET programs. We show that linear ET programs are equally expressive as non-deterministic streaming string transducers under bag semantics. Moreover, we show that linear ET programs are closed under composition. Finally, we present an enumeration algorithm for evaluating every linear ET program over a document with linear time preprocessing and constant delay.

A framework for extraction and transformation of documents

TL;DR

Abstract

Paper Structure (42 sections, 13 theorems, 47 equations, 2 figures, 3 algorithms)

This paper contains 42 sections, 13 theorems, 47 equations, 2 figures, 3 algorithms.

Introduction
Further related work
Multispanners
Multispans and multispanners
Representing multispans by multiref-words
Regex multispanners
Comparison with classical spanners
Extract transform framework
A unique multiref-word representation
Extract-transform programs
Polyregular ET programs
Linear ET programs
Deterministic streaming string transducers
Expressiveness of linear extract transform programs
Nondeterministic streaming string transducers
...and 27 more sections

Key Result

Theorem 4.1

Given a regex multispanner $E$ over $\Sigma$ and $\hbox{$\mathcal{X}$}$ (represented by a multispanner-expression $r$), and a linear polyregular string-to-string function $T$ with input alphabet $\Sigma \cup \Gamma_{\hbox{$\mathcal{X}$}}$ (represented by a DSST with $h$ states), we can construct an

Figures (2)

Figure 1: Nodes $\textsf{s}_1, \ldots, \textsf{s}_6$ are the six possible cases for a node to be safe, where we assume that the rounded node labeled by $\textsf{m}$ is any node such that $\textsf{odepth}_\mathcal{D}(\textsf{m}) \leq 2$. We use $\gamma$ and $\gamma'$ for a relabel and $\sigma$ for a non-relabeling assignment. Solid arrow is used for the $\ell$-edge and dashed arrow is used for the $r$-edge. Also, $\textsf{s}_1$ and $\textsf{s}_2$ are safe output-nodes, and $\textsf{s}_3$ and $\textsf{s}_4$ are safe union-nodes.
Figure 2: On the left, the structure of input nodes $\textsf{n}_1$ and $\textsf{n}_2$ and, on the right, the node $\textsf{n}'$ that represents $\textsf{union}(\textsf{n}_1, \textsf{n}_2)$. In this construction, nodes $\textsf{n}_1'$, $\textsf{n}_2'$, $\textsf{m}_1$, $\textsf{m}_2$ are reused and all other nodes are fresh nodes.

Theorems & Definitions (25)

Example 2.1
Example 2.2
Example 2.3
Example 3.1
Example 3.2
Theorem 4.1
Theorem 4.2
Theorem 5.1
Proposition 5.2
Proposition 5.3
...and 15 more

A framework for extraction and transformation of documents

TL;DR

Abstract

A framework for extraction and transformation of documents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (25)