Improving Unstructured Data Quality via Updatable Extracted Views
Besat Kassaie, Frank Wm. Tompa
TL;DR
The paper addresses improving data quality in unstructured documents by treating information extraction as an updatable view mechanism. It formalizes extracted relations as materialized views and defines stability for rule-based extractors, along with verification and a translation framework to propagate view updates back to source documents. By enabling cleaning to occur on extracted views with guarantees that updates translate predictably to the source, the approach supports robust, end-to-end data quality improvements. An empirical study on de-identified medical records demonstrates practical gains in data quality through updatable extracted views, establishing both a theoretical foundation and a concrete pipeline for integrating unstructured data cleaning into downstream tasks.
Abstract
Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring data quality. We propose leveraging information extraction algorithms to design, apply, and explain data cleaning processes for documents. Specifically, for a simple document update model, we identify and verify a set of sufficient conditions for rule-based extraction programs to qualify for inclusion in our document cleaning framework. Through experiments conducted on medical records, we demonstrate that our approach provides an effective framework for identifying and correcting data quality problems, thereby highlighting its practical value in real-world applications.
