Table of Contents
Fetching ...

Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

Xinyuan Liu, Jiahui Chen, Bocheng Hu, Yu Sun, Xinyang Chen, Shaoxu Song

TL;DR

TableEG introduces instruction-tuned LLMs to generate authentic errors in tabular data via a triplet (I,T,O) framework, trained on 12 real-world datasets to reflect diverse error types and distributions. It uses a four-stage pipeline (Prompt Builder, Trainer, Error Generator, Evaluator) with LoRA-based fine-tuning of LLaMA3.1-8B to produce realistic errors across outliers, missing values, rule violations, and pattern violations. The authors provide a rigorous evaluation framework with Error Pattern Alignment (S_EPA), column-wise distribution metrics (J_col^w), and distribution similarity (D_JS), showing TableEG outperforms BART and unguided GPT-3.5 in realism and remains effective for downstream error detection. The work delivers a practical benchmark for data-cleaning research and highlights future directions for adaptive, domain-aware error generation that reduces reliance on predefined constraints.

Abstract

Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.

Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

TL;DR

TableEG introduces instruction-tuned LLMs to generate authentic errors in tabular data via a triplet (I,T,O) framework, trained on 12 real-world datasets to reflect diverse error types and distributions. It uses a four-stage pipeline (Prompt Builder, Trainer, Error Generator, Evaluator) with LoRA-based fine-tuning of LLaMA3.1-8B to produce realistic errors across outliers, missing values, rule violations, and pattern violations. The authors provide a rigorous evaluation framework with Error Pattern Alignment (S_EPA), column-wise distribution metrics (J_col^w), and distribution similarity (D_JS), showing TableEG outperforms BART and unguided GPT-3.5 in realism and remains effective for downstream error detection. The work delivers a practical benchmark for data-cleaning research and highlights future directions for adaptive, domain-aware error generation that reduces reliance on predefined constraints.

Abstract

Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.

Paper Structure

This paper contains 39 sections, 7 equations, 4 figures, 4 tables, 5 algorithms.

Figures (4)

  • Figure 1: An example of the generated errors by BART arocena2015messing and our TableEG, over the Movie dataset.
  • Figure 2: Overview of training and utilizing our TableEG for error generation.
  • Figure 3: Comparison of $S_{EPA}$ (k=20) between TableEG model, BART and GPT3.5 (turbo) across different datasets.
  • Figure 4: Impact of $k$ on $S_{EPA}$ for model TableEG, BART and GPT3.5 (Turbo).

Theorems & Definitions (2)

  • Definition 1: Data Error
  • Definition 2: Triplet Representation