Table of Contents
Fetching ...

Automatic Data Repair: Are We Ready to Deploy?

Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Jianwei Yin

TL;DR

This work investigates automatic data repair to improve data quality for downstream analysis in the era of generative AI. It introduces a guided information-based taxonomy of 12 repair algorithms (constraint-driven, data-driven, and hybrid-driven), a novel Error Drop Rate (EDR) metric, and a unified optimization strategy that leverages error detection to prevent incorrect repairs. Through extensive experiments on five real datasets and four downstream tasks, the paper reveals that data repair often reduces errors and can boost downstream performance, but many methods may worsen data quality, especially on semantic errors; Baran and MLNClean emerge as particularly effective in different contexts. The findings yield deployment guidelines across multiple scenarios and tasks, while outlining challenges and promising directions, including combining rule discovery with data information and exploring LLM-based repair candidates.

Abstract

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we initially compare and summarize these algorithms using a new guided information-based taxonomy. We then systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms under the settings of various data error rates, error types, and downstream analysis tasks, assessing their error reduction performance with a novel metric. Also, we develop an effective and unified repair optimization strategy that substantially benefits the state of the arts, as empirically confirmed. We demonstrate that, the pure clean data may not necessarily yield the best performance in data analysis tasks and data is always worth repairing regardless of error rate. Based on the found observations and insights, we provide some practical guidelines for 5 scenarios and 2 main data analysis tasks. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice. Finally, we outline research challenges and promising future directions in the data repair field.

Automatic Data Repair: Are We Ready to Deploy?

TL;DR

This work investigates automatic data repair to improve data quality for downstream analysis in the era of generative AI. It introduces a guided information-based taxonomy of 12 repair algorithms (constraint-driven, data-driven, and hybrid-driven), a novel Error Drop Rate (EDR) metric, and a unified optimization strategy that leverages error detection to prevent incorrect repairs. Through extensive experiments on five real datasets and four downstream tasks, the paper reveals that data repair often reduces errors and can boost downstream performance, but many methods may worsen data quality, especially on semantic errors; Baran and MLNClean emerge as particularly effective in different contexts. The findings yield deployment guidelines across multiple scenarios and tasks, while outlining challenges and promising directions, including combining rule discovery with data information and exploring LLM-based repair candidates.

Abstract

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we initially compare and summarize these algorithms using a new guided information-based taxonomy. We then systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms under the settings of various data error rates, error types, and downstream analysis tasks, assessing their error reduction performance with a novel metric. Also, we develop an effective and unified repair optimization strategy that substantially benefits the state of the arts, as empirically confirmed. We demonstrate that, the pure clean data may not necessarily yield the best performance in data analysis tasks and data is always worth repairing regardless of error rate. Based on the found observations and insights, we provide some practical guidelines for 5 scenarios and 2 main data analysis tasks. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice. Finally, we outline research challenges and promising future directions in the data repair field.
Paper Structure (18 sections, 1 equation, 7 figures, 12 tables)

This paper contains 18 sections, 1 equation, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Workflow of cstr/hybrid-driven repair algorithms.
  • Figure 2: Workflow of data-driven repair algorithms.
  • Figure 3: Performance of classification, kNN, and cluster on repaired data.
  • Figure 4: Optimization gain ranking.
  • Figure 5: Data repair performance vs. different error rates.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Definition 3
  • Example 1
  • Definition 4
  • Definition 5