Table of Contents
Fetching ...

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang

TL;DR

DataFlow addresses fragmentation in LLM data preparation by providing a unified, PyTorch-like dataflow framework built on nearly 200 operators and six cross-domain pipelines. Its DataFlow-Agent adds autonomous NL-to-pipeline synthesis with verification, enabling model-in-the-loop generation and end-to-end automation. Across six diverse pipelines, DataFlow delivers consistent performance gains and data-efficiency improvements, including Text-to-SQL, math reasoning, code, and knowledge extraction, often surpassing baselines with an order-of-magnitude smaller data footprint. The open-source DataFlow ecosystem and CLI/agent tooling pave the way for reproducible, scalable data-centric AI workflows.

Abstract

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

TL;DR

DataFlow addresses fragmentation in LLM data preparation by providing a unified, PyTorch-like dataflow framework built on nearly 200 operators and six cross-domain pipelines. Its DataFlow-Agent adds autonomous NL-to-pipeline synthesis with verification, enabling model-in-the-loop generation and end-to-end automation. Across six diverse pipelines, DataFlow delivers consistent performance gains and data-efficiency improvements, including Text-to-SQL, math reasoning, code, and knowledge extraction, often surpassing baselines with an order-of-magnitude smaller data footprint. The open-source DataFlow ecosystem and CLI/agent tooling pave the way for reproducible, scalable data-centric AI workflows.

Abstract

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

Paper Structure

This paper contains 101 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: High-level architecture of DataFlow. The system consists of a core execution engine (storage, operators, templates, and LLM serving), reusable pipelines, user-facing control layers (CLI and agent), and an extensible ecosystem for domain-specialized workflows. DataFlow produces high-quality, task-aligned datasets consumed by downstream LLM applications.
  • Figure 2: The standard execution pattern of an operator’s run() method in DataFlow. Within run(), the operator interacts with the global DataFlowStorage by retrieving inputs through storage.read(), applying its transformation logic, and writing updated fields back via storage.write(). This read--transform--write paradigm captures how data flows from one operator to the next throughout the workflow.
  • Figure 3: Example of how an operator’s run() method interacts with data via key-based bindings. This flexible key-binding mechanism adapts to arbitrary datasets without preprocessing and enables seamless operator composition.
  • Figure 4: Illustration of the DataFlow pipeline API. The example shows how a pipeline declares its storage and serving backends, instantiates operators with task-specific configurations, and executes them via forward() using input/output key bindings. The interface supports compilation and stepwise resumption, enabling flexible and modular workflow construction.
  • Figure 5: Evolution of sample counts across operator stages in DataFlow pipelines. All pipelines start with 1000 input samples. The Text pipeline mainly performs pre-training data filtering, and the Code pipeline focuses on expanding code capabilities based on existing instruction data; therefore, neither of these pipelines involves any generative components.
  • ...and 2 more figures