Table of Contents
Fetching ...

Data Cleaning Using Large Language Models

Shuo Zhang, Zezhou Huang, Eugene Wu

TL;DR

This work introduces Cocoon, a novel data cleaning system that combines statistical error detection and correction with semantic understanding by leveraging large language models and decomposes complex cleaning tasks into manageable components, following a workflow that mimics human cleaning processes.

Abstract

Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy and recall. This work introduces Cocoon, a novel data cleaning system that leverages large language models for rules based on semantic understanding and combines them with statistical error detection. However, data cleaning is still too complex a task for current LLMs to handle in one shot. To address this, we introduce Cocoon, which decomposes complex cleaning tasks into manageable components in a workflow that mimics human cleaning processes. Our experiments show that Cocoon outperforms state-of-the-art data cleaning systems on standard benchmarks.

Data Cleaning Using Large Language Models

TL;DR

This work introduces Cocoon, a novel data cleaning system that combines statistical error detection and correction with semantic understanding by leveraging large language models and decomposes complex cleaning tasks into manageable components, following a workflow that mimics human cleaning processes.

Abstract

Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy and recall. This work introduces Cocoon, a novel data cleaning system that leverages large language models for rules based on semantic understanding and combines them with statistical error detection. However, data cleaning is still too complex a task for current LLMs to handle in one shot. To address this, we introduce Cocoon, which decomposes complex cleaning tasks into manageable components in a workflow that mimics human cleaning processes. Our experiments show that Cocoon outperforms state-of-the-art data cleaning systems on standard benchmarks.

Paper Structure

This paper contains 18 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Cocoon decomposes data cleaning in two dimensions: (a) decompose it for different types of errors, for each column; (b) For each type of error, Cocoon decomposes the cleaning steps with traditional statistical detection, combined with semantic error detection and cleaning.
  • Figure 2: Prompt for Semantic Detection of string outliers for one column through samples.
  • Figure 3: Prompt for Semantic Cleaning of string outliers for one column.
  • Figure 4: The UI for each data cleaning step. The right side is the interface where users specify the SQL clauses for column cast. The left side is the query interface to preview the results.
  • Figure 5: Output SQL queries for results. We provide the cleaning reasoning as NL in the comments and use SQL for cleaning.

Theorems & Definitions (1)

  • Example 1