FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Qiran Zou; Hou Hei Lam; Wenhao Zhao; Yiming Tang; Tingting Chen; Samson Yu; Tianyi Zhang; Chang Liu; Xiangyang Ji; Dianbo Liu

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu

TL;DR

FML-bench introduces a benchmark for automatic ML research agents evaluated on eight fundamental ML tasks drawn from real-world codebases, paired with a unified five-dimensional evaluation framework. It formalizes an iterative research process and a multi-metric objective that balances empirical utility, exploration breadth, academic contribution, cost, and reliability. Experimental results show that agents employing broad exploration strategies tend to outperform narrow, deep refinements, emphasizing breadth as a critical factor in advancing automatic ML research. The work provides practical guidance for agent design and establishes a reproducible benchmark with open-source code to foster progress in scalable, real-world autonomous research systems.

Abstract

Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

TL;DR

Abstract

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)