Table of Contents
Fetching ...

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Irene Testini, José Hernández-Orallo, Lorenzo Pacchiardi

TL;DR

The paper analyzes how LLM-based assistants and agents are evaluated for data-science tasks, revealing a bias toward substitute-like, goal-focused evaluations and a neglect of data-management and exploratory activities. It integrates a DST-based taxonomy with the SAMR framework and surveys a wide range of benchmarks, contrasting assistant-level, agent-level, and end-to-end evaluations. The key contributions are a structured mapping of evaluation tools to data-science activities and autonomy levels, and a set of actionable directions to advance more holistic, human-centered, and transformative assessments. The work aims to standardize evaluation practice and spur benchmarks that reward higher levels of automation and meaningful human–AI collaboration, thereby accelerating practical impact in data-science automation.

Abstract

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances--such as code execution and knowledge bases--that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

TL;DR

The paper analyzes how LLM-based assistants and agents are evaluated for data-science tasks, revealing a bias toward substitute-like, goal-focused evaluations and a neglect of data-management and exploratory activities. It integrates a DST-based taxonomy with the SAMR framework and surveys a wide range of benchmarks, contrasting assistant-level, agent-level, and end-to-end evaluations. The key contributions are a structured mapping of evaluation tools to data-science activities and autonomy levels, and a set of actionable directions to advance more holistic, human-centered, and transformative assessments. The work aims to standardize evaluation practice and spur benchmarks that reward higher levels of automation and meaningful human–AI collaboration, thereby accelerating practical impact in data-science automation.

Abstract

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances--such as code execution and knowledge bases--that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.

Paper Structure

This paper contains 15 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: martinez2019crisp: "The DST map, containing the outer circle of exploratory activities, inner circle of CRISP-DM (or goal-directed) activities, and at the core the data management activities."
  • Figure 2: Lai2023DS1000: "An example problem in DS-1000. The model needs to fill in the code into [insert]in the prompt on the left; the code will then be executed to pass the multi-criteria automatic evaluation, which includes the test cases and the surface-form constraints; a reference solution is provided at the bottom left."
  • Figure 3: Gu2024BLADE: "Overview of BLADE. Given a research question and dataset, LM agents generate a full analysis containing the relevant conceptual variables, a data transform function, and a statistical modeling function (boxes 1-4-5). BLADE automatically evaluates this against the ground truth (box 6)."
  • Figure 4: Li2025IDABenchEL: "Example task trajectory for Walmart sale prediction, showcasing the iterative interaction between the simulated user providing instructions and the agent executing code within the sandbox to achieve the analysis goal."
  • Figure 5: li2025are: The figure shows the "Normal" mode, with the agent being provided all the relevant information and tasked with writing code to address the task, and "Action" mode, where the agent has to take a specific action (in this case, asking for clarification). "Private" refers to tasks requiring the use of bespoke software libraries to which the agent has access to.
  • ...and 2 more figures