Improving Unstructured Data Quality via Updatable Extracted Views

Besat Kassaie; Frank Wm. Tompa

Improving Unstructured Data Quality via Updatable Extracted Views

Besat Kassaie, Frank Wm. Tompa

TL;DR

The paper addresses improving data quality in unstructured documents by treating information extraction as an updatable view mechanism. It formalizes extracted relations as materialized views and defines stability for rule-based extractors, along with verification and a translation framework to propagate view updates back to source documents. By enabling cleaning to occur on extracted views with guarantees that updates translate predictably to the source, the approach supports robust, end-to-end data quality improvements. An empirical study on de-identified medical records demonstrates practical gains in data quality through updatable extracted views, establishing both a theoretical foundation and a concrete pipeline for integrating unstructured data cleaning into downstream tasks.

Abstract

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring data quality. We propose leveraging information extraction algorithms to design, apply, and explain data cleaning processes for documents. Specifically, for a simple document update model, we identify and verify a set of sufficient conditions for rule-based extraction programs to qualify for inclusion in our document cleaning framework. Through experiments conducted on medical records, we demonstrate that our approach provides an effective framework for identifying and correcting data quality problems, thereby highlighting its practical value in real-world applications.

Improving Unstructured Data Quality via Updatable Extracted Views

TL;DR

Abstract

Improving Unstructured Data Quality via Updatable Extracted Views

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (24)