Table of Contents
Fetching ...

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu

TL;DR

DAComp introduces a rigorous benchmark for autonomous data agents across the full data intelligence lifecycle, pairing repository-level data engineering with open-ended data analysis. It defines 210 tasks split into DAComp-DE and DAComp-DA, and uses a hierarchical LLM judge alongside deterministic execution metrics to evaluate end-to-end orchestration and analytical reasoning. Experimental results reveal substantial gaps: even leading models struggle with holistic pipeline maintenance and open-ended insight synthesis, underscoring the need for improvements beyond isolated code generation. The benchmark, including a Chinese adaptation, provides a realistic, multilingual testbed to push toward truly autonomous enterprise data agents and is complemented by publicly available data and code.

Abstract

Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

TL;DR

DAComp introduces a rigorous benchmark for autonomous data agents across the full data intelligence lifecycle, pairing repository-level data engineering with open-ended data analysis. It defines 210 tasks split into DAComp-DE and DAComp-DA, and uses a hierarchical LLM judge alongside deterministic execution metrics to evaluate end-to-end orchestration and analytical reasoning. Experimental results reveal substantial gaps: even leading models struggle with holistic pipeline maintenance and open-ended insight synthesis, underscoring the need for improvements beyond isolated code generation. The benchmark, including a Chinese adaptation, provides a realistic, multilingual testbed to push toward truly autonomous enterprise data agents and is complemented by publicly available data and code.

Abstract

Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

Paper Structure

This paper contains 49 sections, 5 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: DAComp aims to evaluate LLMs on full-lifecycle data intelligence workflows, encompassing repository-level data engineering (DE) and open-ended data analysis (DA).
  • Figure 2: Details of hierarchical rubrics.
  • Figure 3: Data cleaning tasks of DE-Impl staging layer.
  • Figure 4: Component-level performance analysis.
  • Figure 5: Error distribution (left), pipeline survival rate (right).
  • ...and 6 more figures