Table of Contents
Fetching ...

Dataforge: A Data Agent Platform for Autonomous Data Engineering

Xinyuan Wang, Yanjie Fu

TL;DR

The paper addresses the bottleneck of transforming heterogeneous raw data into AI-ready inputs by introducing Dataforge, an autonomous Data Agent designed for tabular data. It leverages LLM reasoning with grounding, hierarchical routing, and dual feedback loops to perform end-to-end data cleaning, feature engineering, and validation without expert intervention, achieving reliable, scalable data preparation. The six-stage Dataforge pipeline, safety-focused grounding, automatic provenance reporting, and a user-friendly interface enable non-experts to produce high-quality data representations with minimal overhead. Demonstrations across nine datasets and a heart-disease detection task illustrate substantial gains in automation, robustness, and efficiency, highlighting practical impact for AI workflows in materials science, biology, and climate science.

Abstract

The growing demand for AI applications in fields such as materials discovery, molecular modeling, and climate science has made data preparation an important but labor-intensive step. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, while effective feature transformation and selection are essential for efficient training and inference. To address the challenges of scalability and expertise dependence, we present Data Agent, a fully autonomous system specialized for tabular data. Leveraging large language model (LLM) reasoning and grounded validation, Data Agent automatically performs data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops. It embodies three core principles: automatic, safe, and non-expert friendly, which ensure end-to-end reliability without human supervision. This demo showcases the first practical realization of an autonomous Data Agent, illustrating how raw data can be transformed "From Data to Better Data."

Dataforge: A Data Agent Platform for Autonomous Data Engineering

TL;DR

The paper addresses the bottleneck of transforming heterogeneous raw data into AI-ready inputs by introducing Dataforge, an autonomous Data Agent designed for tabular data. It leverages LLM reasoning with grounding, hierarchical routing, and dual feedback loops to perform end-to-end data cleaning, feature engineering, and validation without expert intervention, achieving reliable, scalable data preparation. The six-stage Dataforge pipeline, safety-focused grounding, automatic provenance reporting, and a user-friendly interface enable non-experts to produce high-quality data representations with minimal overhead. Demonstrations across nine datasets and a heart-disease detection task illustrate substantial gains in automation, robustness, and efficiency, highlighting practical impact for AI workflows in materials science, biology, and climate science.

Abstract

The growing demand for AI applications in fields such as materials discovery, molecular modeling, and climate science has made data preparation an important but labor-intensive step. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, while effective feature transformation and selection are essential for efficient training and inference. To address the challenges of scalability and expertise dependence, we present Data Agent, a fully autonomous system specialized for tabular data. Leveraging large language model (LLM) reasoning and grounded validation, Data Agent automatically performs data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops. It embodies three core principles: automatic, safe, and non-expert friendly, which ensure end-to-end reliability without human supervision. This demo showcases the first practical realization of an autonomous Data Agent, illustrating how raw data can be transformed "From Data to Better Data."

Paper Structure

This paper contains 11 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Conceptual comparison between a traditional manual workflow and the agentic workflow.
  • Figure 2: The framework of the Dataforge system.
  • Figure 3: The interface of Dataforge.
  • Figure 4: Dataforge Dealing with Heart-Disease Detection.