AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo; Jin Du; Xun Xian; Robert Specht; Fangqiao Tian; Ganghua Wang; Xuan Bi; Charles Fleming; Ashish Kundu; Jayanth Srinivasa; Mingyi Hong; Rui Zhang; Tianxi Li; Galin Jones; Jie Ding

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding

Abstract

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Abstract

Paper Structure (16 sections, 1 equation, 3 figures, 1 table)

This paper contains 16 sections, 1 equation, 3 figures, 1 table.

Introduction
The AgentDS Benchmark and Competition
Design Philosophy
Benchmark Scope
Data Curation Process
Evaluation Framework
The AgentDS Competition
AI-Only Baselines
Baseline configurations
Performance of AI-only baselines
Empirical Findings from AgentDS
AI Agents Struggle with Domain-Specific Reasoning
Human Expertise Provides Irreplaceable Value
Human-AI Collaboration Outperforms Either Alone
Limitations and Future Work
...and 1 more sections

Figures (3)

Figure 1: Overall quantile score comparison between both AI baselines and competition teams (n=29). The GPT-4o baseline (orange, score: 0.143) ranks 17th, falling below the participant median of 0.156 (dashed line). The Claude Code agentic baseline (purple, score: 0.458) ranks 10th, exceeding the median and placing in the top third of participants. Bars are sorted descending by score (Team 1 = best); both AI baselines are inserted at their rank positions. Quantile scores represent the average of per-challenge normalized rankings, with 1.0 indicating best performance and 0.0 indicating non-participation. The result shows that current AI-only baselines, whether using direct prompting or agentic coding, do not match the performance of the top human teams in the competition, highlighting a substantial gap between AI automation and human data science expertise.
Figure 2: Distribution of domain-level quantile scores across all participants (teal dots), with GPT-4o baseline indicated by orange diamonds and Claude Code baseline by purple squares. GPT-4o falls at or below the domain median in all six domains, with particularly weak performance in Commerce (0.021) and Retail Banking (0.000). Claude Code substantially outperforms GPT-4o in every domain, most notably Manufacturing (0.573), Food Production (0.532), and Retail Banking (0.553), but remains well below the top-performing human teams in each domain, confirming that general-purpose AI, even agentic ones, cannot yet replicate the domain-specific strategies of expert human data scientists.
Figure 3: Challenge-specific quantile score distributions across six domains. Teal dots represent participants who submitted for each challenge (zero-score non-submitters excluded from display); orange diamonds show the GPT-4o baseline; purple squares show the Claude Code baseline; gray dashed lines indicate per-challenge participant medians among submitters. Claude Code outperforms GPT-4o across the majority of challenges, with the largest gains in Manufacturing Ch. 1 (Claude: 0.655, GPT-4o: 0.000), Retail Banking Ch. 1 (Claude: 0.741, GPT-4o: 0.000), and Commerce Ch. 3 (Claude: 0.534, GPT-4o: 0.000). Neither system achieves top-quartile performance on every challenge, confirming that current AI approaches cannot match the best human solutions, which leverage domain knowledge, multimodal signals, and iterative expert refinement.

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Abstract

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Authors

Abstract

Table of Contents

Figures (3)