AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan; Nicolas Baldwin; Edan Toledo; Rishi Hazra; Michael Kuchnik; Bassel Al Omari; Thomas Simon Foster; Anton Protopopov; Jean-Christophe Gagnon-Audet; Ishita Mediratta; Kelvin Niu; Michael Shvartsman; Alisia Lupidi; Alexis Audran-Reiss; Parth Pathak; Tatiana Shavrina; Despoina Magka; Hela Momand; Derek Dunfield; Nicola Cancedda; Pontus Stenetorp; Carole-Jean Wu; Jakob Nicolaus Foerster; Yoram Bachrach; Martin Josifoski

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pontus Stenetorp, Carole-Jean Wu, Jakob Nicolaus Foerster, Yoram Bachrach, Martin Josifoski

Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$_2$ achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Abstract

, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA

achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Abstract

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Abstract

Paper Structure

Table of Contents

Figures (5)