Table of Contents
Fetching ...

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach

TL;DR

AIRS-Bench identifies a gap in standardized evaluation for AI Research Agents and presents a benchmark of 20 end-to-end tasks drawn from recent ML literature to assess autonomous scientific workflows. It formalizes agents as LLMs plus scaffolds, explores sequential and parallel harnesses, and uses a fixed task-definition standard to enable fair cross-framework comparisons. The paper introduces an evaluation protocol with robust metrics, including mean valid submission rate, a normalized performance score, and Elo-style skill ratings, and provides empirical results showing large performance variability with occasional SOTA-level breakthroughs. The work demonstrates substantial headroom for improvement, documents practical challenges in reproducibility, and open-sources task definitions and tooling to catalyze progress toward truly autonomous AI research agents with broad domain applicability.

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

TL;DR

AIRS-Bench identifies a gap in standardized evaluation for AI Research Agents and presents a benchmark of 20 end-to-end tasks drawn from recent ML literature to assess autonomous scientific workflows. It formalizes agents as LLMs plus scaffolds, explores sequential and parallel harnesses, and uses a fixed task-definition standard to enable fair cross-framework comparisons. The paper introduces an evaluation protocol with robust metrics, including mean valid submission rate, a normalized performance score, and Elo-style skill ratings, and provides empirical results showing large performance variability with occasional SOTA-level breakthroughs. The work demonstrates substantial headroom for improvement, documents practical challenges in reproducibility, and open-sources task definitions and tooling to catalyze progress toward truly autonomous AI research agents with broad domain applicability.

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
Paper Structure (57 sections, 8 equations, 14 figures, 8 tables)

This paper contains 57 sections, 8 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Example of an AIRS-Bench task. Each task is specified by a {problem, dataset, metric} triplet. The problem defines the core computational challenge to be solved (e.g. textual similarity); the dataset specifies which data to solve the challenge over (e.g. SICK); finally, the metric is used to quantify performance (e.g. Spearman correlation). The agent receives the full task specification and is expected to develop a solution that in most cases generates predictions on the test labels file, which are then evaluated and compared with the state-of-the-art result.
  • Figure 2: We define an agent as a pair consisting of a large language model (LLM) and a scaffold. A scaffold comprises a set of mechanisms, such as operators and search algorithms, that enable the LLM to explore the solution space effectively. Scaffolds are instantiated by a harness, which serves as a system that encapsulates the agent and manages its research process. The environment provides the agent with the problem specifications, as well as any constraints and resources available for its exploration.
  • Figure 3: Distribution of AIRS-Bench tasks by category. We consider 7 distinct task categories in total: Code, Math, Molecules & Proteins ML, Question Answering, Text Classification, Text Extraction & Matching, and Time Series.
  • Figure 4: Overall performance of the 14 evaluated agents on the three metrics introduced in Section 5.2, namely valid submission rate, average normalized score and Elo rating. Results are ordered by increasing average normalized score.
  • Figure 5: Submission rate distribution for the 14 agents tested. Each bar shows the distribution of submission rates across tasks for a given agent. The categories are defined as follows: invalid indicates that the agent did not provide any valid submission for that task (0% valid submissions); low (1--33%) indicates a valid submission for between 1% and 33% of seeds; medium (34--66%) indicates a valid submission for between 34% and 66% of seeds; and high (67--100%) indicates a valid submission for more than 66% of seeds. Agents are sorted by the combined percentage of seeds in the medium and high categories, highlighting those most reliable across the benchmark.
  • ...and 9 more figures