Table of Contents
Fetching ...

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu

TL;DR

DA-Code introduces a challenging, agent-centered benchmark for data science code generation, targeting data wrangling, EDA, and ML tasks in a fully executable sandbox. It defines a formal agent interaction framework with explicit memory, action, and execution loops, and couples this with a robust annotation pipeline and a multi-metric evaluation suite (table, chart, and ML performance). Empirical results show current state-of-the-art LLMs struggle on these tasks, with GPT-4 achieving around 30.5% overall accuracy, underscoring substantial room for improvement. The work also provides a DA-Agent baseline and extensive analysis (including ablations and trajectory studies) to guide future enhancements in planning, grounding, and environment interaction for data-science oriented agents.

Abstract

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

TL;DR

DA-Code introduces a challenging, agent-centered benchmark for data science code generation, targeting data wrangling, EDA, and ML tasks in a fully executable sandbox. It defines a formal agent interaction framework with explicit memory, action, and execution loops, and couples this with a robust annotation pipeline and a multi-metric evaluation suite (table, chart, and ML performance). Empirical results show current state-of-the-art LLMs struggle on these tasks, with GPT-4 achieving around 30.5% overall accuracy, underscoring substantial room for improvement. The work also provides a DA-Agent baseline and extensive analysis (including ablations and trajectory studies) to guide future enhancements in planning, grounding, and environment interaction for data-science oriented agents.

Abstract

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

Paper Structure

This paper contains 110 sections, 22 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: DA-Agent tackling an example in DA-Code.
  • Figure 2: Annotation pipeline of DA-Code. See the start of Section \ref{['sec:annotation_pipeline']} for a detailed description.
  • Figure 3: Detailed performance analysis of DA-Agent across various categories on DA-Code.
  • Figure 4: The task instruction and the corresponding reference plan of the DA-Code example.
  • Figure 5: The success and incompletion rates of various models over steps. The incompletion rates represent the proportion of tasks not completed and the success rates indicate the proportion of tasks successfully completed within the current step. A task is considered successful if it scores above zero.
  • ...and 3 more figures