Table of Contents
Fetching ...

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

Tommaso Bendinelli, Artur Dox, Christian Holz

TL;DR

The paper investigates whether Large Language Models (LLMs) can improve model performance by cleaning training data without altering the data pipeline. It introduces a simple, iterative framework where an LLM interacts with an injured dataset via an IPython shell and a performance-evaluation tool to produce and evaluate cleaned datasets, with three error types (Numerical Shift, NaN Corruption, Categorical Shift) injected into Kaggle datasets. Results show that LLMs can identify and correct some errors using row-level context and prior iterations, but struggle with complex, distribution-wide errors; providing hints generally improves performance, though benefits depend on dataset and model. The work provides a benchmark for LLM-driven data cleaning, highlighting practical potential and clear limitations, and outlines future directions for richer feedback, automation, and broader datasets to improve robustness and scalability.

Abstract

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor-made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor-intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

TL;DR

The paper investigates whether Large Language Models (LLMs) can improve model performance by cleaning training data without altering the data pipeline. It introduces a simple, iterative framework where an LLM interacts with an injured dataset via an IPython shell and a performance-evaluation tool to produce and evaluate cleaned datasets, with three error types (Numerical Shift, NaN Corruption, Categorical Shift) injected into Kaggle datasets. Results show that LLMs can identify and correct some errors using row-level context and prior iterations, but struggle with complex, distribution-wide errors; providing hints generally improves performance, though benefits depend on dataset and model. The work provides a benchmark for LLM-driven data cleaning, highlighting practical potential and clear limitations, and outlines future directions for richer feedback, automation, and broader datasets to improve robustness and scalability.

Abstract

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor-made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor-intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.

Paper Structure

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We provide the model with the path to the dataset along with a prompt instructing it to identify errors so that performance on a held-out set increases by a given threshold. At each iteration $j$, the LLM can send code to IPython to execute and get back the sys.output and/or send the path of the modified dataset $\mathcal{D}_{\text{i}}$ to get a performance score. The loop continues until the cumulative number of tokens used for the entire conversation reaches a pre-defined threshold. All the modified datasets $\mathcal{D}_{0...\text{i}}$ are stored and the dataset with the highest score is considered as $\mathcal{D}_{\text{Best}}$.
  • Figure 2: Performance improvement over $P_{\text{Dirty}}$ for the four models and three datasets.
  • Figure 3: Performance improvement for different Cumulative tokens thresholds from 25k to 200k
  • Figure 4: Impact of providing no hint, a weak hint, and a strong hint on the performance improvement for all models and datasets.