Table of Contents
Fetching ...

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan

TL;DR

DARE-bench is introduced, a benchmark designed for machine learning modeling and data science instruction following that consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets.

Abstract

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

TL;DR

DARE-bench is introduced, a benchmark designed for machine learning modeling and data science instruction following that consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets.

Abstract

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
Paper Structure (43 sections, 8 equations, 3 figures, 13 tables)

This paper contains 43 sections, 8 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: DARE-bench defines each task by providing a natural-language question and structured files (metadata and train/test splits). An LLM agent executes code within a sandbox to generate predictions, which are compared against ground truth for automatic and reproducible evaluation.
  • Figure 2: Automated pipeline of DARE-bench. The construction process consists of four stages: (1) Dataset Sourcing, where Kaggle datasets are filtered by tags, license, size, and metadata; (2) Task Design, where schema summaries, targets, features, and feasibility are analyzed with the help of LLM; (3) Post-Process, including splitting, noise injection for IF tasks or resampling or entity checks for time-series-CF tasks; and (4) Finalization, which validates solvability in a sandbox for IF tasks and produces standardized benchmark artifacts.
  • Figure 3: Example of an instruction-following task where the agent fails to respect explicit constraints. Despite being asked to fix the random seed, the model omitted the required argument, leading to incorrect predictions and an evaluation failure.