Table of Contents
Fetching ...

Data Wrangling Task Automation Using Code-Generating Language Models

Ashlesha Akella, Krishnasuri Narayanam

TL;DR

The paper tackles data quality in large tabular datasets by replacing row-by-row LLM inference with a code-generation workflow that translates data patterns into executable wrangling code. It introduces a two-path methodology (memory-dependent vs memory-independent) with column relevance filtering, retrieval of external knowledge, and iterative prompt refinement, leveraging $D$, $\tilde{D}$, $G$, and $KB_{\tilde{D}}$ to decide the workflow. Using $k$-fold cross-validation to generate multiple code snippets and a majority consensus, the approach achieves high accuracy (0.92–0.99) on imputation, error detection, and correction tasks while greatly reducing LLM calls, compared to per-row inference. This offers scalable, semantically informed data wrangling suitable for industrial datasets and diverse tabular domains.

Abstract

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning approaches are resource-intensive, requiring task and dataset-specific training. To overcome these shortcomings, we present an automated system that utilizes large language models to generate executable code for tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-dependent and memory-independent tasks.

Data Wrangling Task Automation Using Code-Generating Language Models

TL;DR

The paper tackles data quality in large tabular datasets by replacing row-by-row LLM inference with a code-generation workflow that translates data patterns into executable wrangling code. It introduces a two-path methodology (memory-dependent vs memory-independent) with column relevance filtering, retrieval of external knowledge, and iterative prompt refinement, leveraging , , , and to decide the workflow. Using -fold cross-validation to generate multiple code snippets and a majority consensus, the approach achieves high accuracy (0.92–0.99) on imputation, error detection, and correction tasks while greatly reducing LLM calls, compared to per-row inference. This offers scalable, semantically informed data wrangling suitable for industrial datasets and diverse tabular domains.

Abstract

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning approaches are resource-intensive, requiring task and dataset-specific training. To overcome these shortcomings, we present an automated system that utilizes large language models to generate executable code for tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-dependent and memory-independent tasks.

Paper Structure

This paper contains 5 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Auto generated code by our system for Missing Value Imputation task. Impute 'Category' in BigBasket dataset (above) and 'Country Name' in Airline dataset.
  • Figure 2: Workflow of data wrangling task automation through code generation