Table of Contents
Fetching ...

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska

TL;DR

KramaBench tackles the lack of real-world, end-to-end benchmarks for data-to-insight pipelines operating over heterogeneous data lakes. By compiling 104 pipelines across 6 domains and paired with a reference implementation (DS-Guru) and multiple agentic systems, the study evaluates end-to-end automation, pipeline design, and sub-task implementation. Findings reveal that while out-of-the-box LLMs can tackle well-specified coding tasks, end-to-end pipeline construction under realistic data and domain constraints remains difficult, with agentic approaches delivering the strongest performance (up to ~50% overall). The work identifies core bottlenecks in data retrieval, data-dependent reasoning, and integration of prior domain knowledge, and provides open-source artifacts to propel development of autonomous data science agents for real-world applications.

Abstract

Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

TL;DR

KramaBench tackles the lack of real-world, end-to-end benchmarks for data-to-insight pipelines operating over heterogeneous data lakes. By compiling 104 pipelines across 6 domains and paired with a reference implementation (DS-Guru) and multiple agentic systems, the study evaluates end-to-end automation, pipeline design, and sub-task implementation. Findings reveal that while out-of-the-box LLMs can tackle well-specified coding tasks, end-to-end pipeline construction under realistic data and domain constraints remains difficult, with agentic approaches delivering the strongest performance (up to ~50% overall). The work identifies core bottlenecks in data retrieval, data-dependent reasoning, and integration of prior domain knowledge, and provides open-source artifacts to propel development of autonomous data science agents for real-world applications.

Abstract

Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.

Paper Structure

This paper contains 21 sections, 2 equations, 2 figures, 16 tables.

Figures (2)

  • Figure 1: One of the tasks of KramaBench based on a real data lake of 136 files in the legal discovery domain. Data file sample snippets are simplified.
  • Figure 2: Data snippets for study cases. Multiple water testing entries for each location may exist.