TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

Mykola Pinchuk

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

Mykola Pinchuk

TL;DR

TML-Bench is introduced, a tabular benchmark for data science agents on Kaggle-style tasks and MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation.

Abstract

Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private-holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run-to-run variability. MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at https://github.com/MykolaPinchuk/TML-bench/tree/master.

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

TL;DR

Abstract

Paper Structure (49 sections, 14 figures, 2 tables)

This paper contains 49 sections, 14 figures, 2 tables.

Introduction
Benchmark and protocol
Suite and evaluation grid
Prompt strategy and aggregation rule
Evaluation logging and reproducibility
Agent harness (Kilo Code)
Contamination controls
Metrics and normalization
Time budgets
Results
Key findings
Aggregate performance leaderboard
Cross-competition consistency
Reliability and stability
Scaling with time budget
...and 34 more sections

Figures (14)

Figure 1: Aggregate performance leaderboard (primary aggregation: best budget per competition).
Figure 2: Per-competition ranks (1=best).
Figure 3: Performance vs stability (each dot is one model; fill color indicates success rate).
Figure 4: Scaling with time budget.
Figure 5: Overall aggregation (all competitions and all budgets).
...and 9 more figures

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

TL;DR

Abstract

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (14)