Table of Contents
Fetching ...

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu

TL;DR

FML-bench introduces a benchmark for automatic ML research agents evaluated on eight fundamental ML tasks drawn from real-world codebases, paired with a unified five-dimensional evaluation framework. It formalizes an iterative research process and a multi-metric objective that balances empirical utility, exploration breadth, academic contribution, cost, and reliability. Experimental results show that agents employing broad exploration strategies tend to outperform narrow, deep refinements, emphasizing breadth as a critical factor in advancing automatic ML research. The work provides practical guidance for agent design and establishes a reproducible benchmark with open-source code to foster progress in scalable, real-world autonomous research systems.

Abstract

Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

TL;DR

FML-bench introduces a benchmark for automatic ML research agents evaluated on eight fundamental ML tasks drawn from real-world codebases, paired with a unified five-dimensional evaluation framework. It formalizes an iterative research process and a multi-metric objective that balances empirical utility, exploration breadth, academic contribution, cost, and reliability. Experimental results show that agents employing broad exploration strategies tend to outperform narrow, deep refinements, emphasizing breadth as a critical factor in advancing automatic ML research. The work provides practical guidance for agent design and establishes a reproducible benchmark with open-source code to foster progress in scalable, real-world autonomous research systems.

Abstract

Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

Paper Structure

This paper contains 56 sections, 2 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Overview of FML-bench. FML-bench includes 8 fundamental machine learning research tasks, designed to evaluate agents’ capabilities in solving machine learning research problems. Agents are assessed on their ability to solve machine learning problems through iterative research.
  • Figure 2: Comparison of research exploration strategies of different agents. TheAIScientist uses parallel exploration for broad coverage, AIDE employs hierarchical tree-based search balancing exploration and exploitation, while Claude Code follows linear refinement for sequential improvement.
  • Figure 3: Agents' performance improvement curves across 8 tasks.
  • Figure 4: Performance - Diversity Analysis.