Table of Contents
Fetching ...

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach

TL;DR

The paper reframes AI research agents as graph‑based search processes and demonstrates that operator design can bottleneck performance, sometimes more than the search strategy. By engineering an enhanced operator set (AIRA) and evaluating across Greedy, MCTS, and Evolutionary search within the AIRA‑dojo framework, the authors achieve state‑of‑the‑art medal rates on MLE‑bench lite. They also expose a substantial generalization gap between validation and test scores and propose robust final‑node selection and multi‑submission strategies to mitigate it. The work highlights the need to co‑design search policies, operators, and evaluation protocols to advance automated ML discovery, and provides a scalable, reproducible platform for future research. Overall, the study demonstrates that jointly optimizing operators and search strategies yields tangible performance gains and deeper insights into the dynamics of automated ML engineering.

Abstract

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

TL;DR

The paper reframes AI research agents as graph‑based search processes and demonstrates that operator design can bottleneck performance, sometimes more than the search strategy. By engineering an enhanced operator set (AIRA) and evaluating across Greedy, MCTS, and Evolutionary search within the AIRA‑dojo framework, the authors achieve state‑of‑the‑art medal rates on MLE‑bench lite. They also expose a substantial generalization gap between validation and test scores and propose robust final‑node selection and multi‑submission strategies to mitigate it. The work highlights the need to co‑design search policies, operators, and evaluation protocols to advance automated ML discovery, and provides a scalable, reproducible platform for future research. Overall, the study demonstrates that jointly optimizing operators and search strategies yields tangible performance gains and deeper insights into the dynamics of automated ML engineering.

Abstract

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.

Paper Structure

This paper contains 39 sections, 2 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: AIRA agents use the AIRA-dojo environment, AIRA operators, and search policies to achieve state-of-the-art performance on MLE-Bench lite. Operating in AIRA-dojo improves the performance of AIDE---the previous state-of-the-art from mlebench. Enhanced operators lead to an improvement from AIDEgreedy to AIRAgreedy. Exploring the space of search policies can yield further performance gains, only when these constraints are addressed.
  • Figure 2: Overview of AIRA. Given a problem specification, AIRA maintains a search graph whose nodes are (partial) solutions. At each iteration, the agent (1) selects nodes via a selection policy, (2) picks an operator via an operator policy and applies this operator to the node, and (3) scores the resulting solution via a fitness function. Here, a greedy node‐selection strategy applies the improve operator to the highest‐scoring node.
  • Figure 3: Searching with AIDE's operators. When limited to AIDE's operator set $\mathcal{O}_{\text{AIDE}}$, agents using more advances search policies (e.g., MCTS, evolutionary algorithms) gain no advantage, underscoring the operator set as the bottleneck.
  • Figure 4: a) AIDE's performance profile over 24-hour search window. Perceived vs. actual medal rate over 24 hours of AIDEgreedy. The curves show the mean validation (agent-reported) and held-out test medal rates across 20 seeds for all tasks. The widening band illustrates the generalization gap, revealing how apparent gains on the validation set can mask overfitting and ultimately undermine the search process. b) Performance profiles of all agents after 24-hour search window.
  • Figure 5: Medal rates on MLE-bench Lite. Performance is shown for three medal categories: any medal, silver medals and above, and gold medals only. Error bars represent 95% confidence intervals computed using stratified bootstrapping.
  • ...and 12 more figures