Table of Contents
Fetching ...

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pontus Stenetorp, Carole-Jean Wu, Jakob Nicolaus Foerster, Yoram Bachrach, Martin Josifoski

Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$_2$ achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

Paper Structure

This paper contains 40 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: AIRA2performance on MLE-bench-30. We evaluate AIRA2 against top-performing agents from the MLE-bench leaderboard across different compute budgets. Utilizing 8 GPU workers for all configurations, AIRA2 matches the performance of the strongest leaderboard agents at a 24-GPU-hour budget. Performance improves consistently with additional compute, demonstrating the effectiveness of our architectural design.
  • Figure 2: AIRA2 architecture. The Evolutionary Agent orchestrates the search by maintaining a population of candidate solutions and dispatching mutation tasks to the $N$ workers as they become available, without any synchronization barriers. Each worker asynchronously executes a ReAct agent which iteratively reasons, executes code, and observes outputs until a candidate solution is ready. Candidate solutions are evaluated in a separate container, and agents observe only the resulting score. Evaluation is partitioned: $\mathcal{D}_{\text{search}}$ guides optimization while $\mathcal{D}_{\text{val}}$ determines final selection. In our main experiments, we use $N$ = 8 workers.
  • Figure 3: Compute Analysis. We analyse the impact of parallel resources on AIRA2, demonstrating that effective use of parallel compute requires both additional resources and an evolutionary mechanism to utilize them.
  • Figure 4: (a) Stabilizing Long-Horizon Search. We compare the standard self-reported evaluation (blue) against our Hidden Consistent Evaluation protocol (green). While self-reporting leads to eventual performance degradation (confirming toledo2025ai), consistent evaluation ensures long-term improvement. Furthermore, the marginal difference between selecting via $\mathcal{D_{\text{search}}}$ (seen) and $\mathcal{D_{\text{val}}}$ (unseen) splits suggests the degradation in prior work was due to evaluation noise, not true data overfitting. (b) The performance profile of AIRA2: We observe steady increase in performance in all configurations, with 8-worker parallel version performing the best, achieving the highest Percentile Rank among all evaluated agents at 24 hours, while performing competitively at the 24-gpu-hours mark.
  • Figure 5: Example of typical behaviour observed in AIRA2 during the champs-scalar-coupling task. The left side displays the chronological performance of each solution in the database (solid lines for mutation, dashed for crossover). The right-hand panel presents a concise summary of the agent's thought process for the nodes and trajectory annotated in green, highlighting the "eureka" moment where the agent identifies underfitting and subsequently scales the model to achieve medal-winning performance. Horizontal dashed lines indicate the medal thresholds and display the previous best attempt thesislabs2025sota on the task among all agents.