Table of Contents
Fetching ...

Improving Unstructured Data Quality via Updatable Extracted Views

Besat Kassaie, Frank Wm. Tompa

TL;DR

The paper addresses improving data quality in unstructured documents by treating information extraction as an updatable view mechanism. It formalizes extracted relations as materialized views and defines stability for rule-based extractors, along with verification and a translation framework to propagate view updates back to source documents. By enabling cleaning to occur on extracted views with guarantees that updates translate predictably to the source, the approach supports robust, end-to-end data quality improvements. An empirical study on de-identified medical records demonstrates practical gains in data quality through updatable extracted views, establishing both a theoretical foundation and a concrete pipeline for integrating unstructured data cleaning into downstream tasks.

Abstract

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring data quality. We propose leveraging information extraction algorithms to design, apply, and explain data cleaning processes for documents. Specifically, for a simple document update model, we identify and verify a set of sufficient conditions for rule-based extraction programs to qualify for inclusion in our document cleaning framework. Through experiments conducted on medical records, we demonstrate that our approach provides an effective framework for identifying and correcting data quality problems, thereby highlighting its practical value in real-world applications.

Improving Unstructured Data Quality via Updatable Extracted Views

TL;DR

The paper addresses improving data quality in unstructured documents by treating information extraction as an updatable view mechanism. It formalizes extracted relations as materialized views and defines stability for rule-based extractors, along with verification and a translation framework to propagate view updates back to source documents. By enabling cleaning to occur on extracted views with guarantees that updates translate predictably to the source, the approach supports robust, end-to-end data quality improvements. An empirical study on de-identified medical records demonstrates practical gains in data quality through updatable extracted views, establishing both a theoretical foundation and a concrete pipeline for integrating unstructured data cleaning into downstream tasks.

Abstract

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring data quality. We propose leveraging information extraction algorithms to design, apply, and explain data cleaning processes for documents. Specifically, for a simple document update model, we identify and verify a set of sufficient conditions for rule-based extraction programs to qualify for inclusion in our document cleaning framework. Through experiments conducted on medical records, we demonstrate that our approach provides an effective framework for identifying and correcting data quality problems, thereby highlighting its practical value in real-world applications.

Paper Structure

This paper contains 28 sections, 12 theorems, 16 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 5

Consider a stable extractor $\mathcal{X}$, any indexed set of domain preserving functions $\mathcal{F}=\{f_i | f_i : W_i \to W_i$, where $i \in [ 1 \dots \mathcal{T} ] \}$, and any document $D$. For all $i \in [ 1 \dots \mathcal{T} ]$ and $r \in R$, substituting $f_i(v_i)$ for $v_i$ in $[a_i,b_i \ra

Figures (6)

  • Figure 1: Applying extractor $E$ to document $D$ produces table $V$. Updating $V$ to form $V'$, mapping the update back to form $D'$, and then re-applying $E$ should produce that same updated table $V'$.
  • Figure 2: Extraction system that supports updates to source documents as well as extracted views.
  • Figure 3: A sample input document and its updated version, with associated offsets indicated beneath each character, starting from $1$. The substrings in red undergo updates and those highlighted in green represent new values.
  • Figure 4: Extracted relation and its updated version for motivating example.
  • Figure 5: Overview of Proposed Document Cleaning Framework.
  • ...and 1 more figures

Theorems & Definitions (24)

  • Definition 1
  • Example 2
  • Definition 3
  • Definition 4
  • Theorem 5
  • Definition 6
  • Lemma 7
  • Example 8
  • Lemma 9
  • Example 10
  • ...and 14 more