Table of Contents
Fetching ...

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

TL;DR

MLRC-Bench catalogs seven frontiers from ML conferences into a repository-level research challenge where language agents must both generate novel research ideas and implement them under compute constraints. It combines objective evaluation across Effectiveness, Efficiency, and Simplicity with a Relative Improvement to Human baseline to enable cross-task comparisons and fair benchmarking. The results show that even the best agents achieve only modest improvements (e.g., 9.3% of the human gap) and that LLM-judged novelty poorly tracks empirical performance, underscoring gaps between perceived and actual research impact. The benchmark is designed to evolve with the field, promote rigorous evaluation of AI-assisted research, and encourage safer, scalable automation of scientific discovery.

Abstract

We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

TL;DR

MLRC-Bench catalogs seven frontiers from ML conferences into a repository-level research challenge where language agents must both generate novel research ideas and implement them under compute constraints. It combines objective evaluation across Effectiveness, Efficiency, and Simplicity with a Relative Improvement to Human baseline to enable cross-task comparisons and fair benchmarking. The results show that even the best agents achieve only modest improvements (e.g., 9.3% of the human gap) and that LLM-judged novelty poorly tracks empirical performance, underscoring gaps between perceived and actual research impact. The benchmark is designed to evolve with the field, promote rigorous evaluation of AI-assisted research, and encourage safer, scalable automation of scientific discovery.

Abstract

We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench

Paper Structure

This paper contains 44 sections, 1 equation, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Overview of MLRC-Bench and its evaluation pipeline. MLRC-Bench standardizes ML conference competitions into an agent-agnostic framework featuring repository-level code execution under compute constraints. Its evaluation relies on objective metrics (effectiveness, efficiency, simplicity) while using subjective LLM-judge scores only to analyze their correlation with objective metrics for assessing LLM-judge reliability (Section \ref{['sec:subj_eval']}).
  • Figure 2: Radar plots of objective and subjective evaluations for agent-generated solutions across seven research tasks. Each dimension is normalized on a 1–5 scale, where higher values indicate better performance. Objective metrics include Effectiveness, Efficiency, and Simplicity (Simp.), which are highlighted in bold. The rest are subjective metrics, assessed by prompting o1 as a judge. Notably, more effective solutions identified by agents tend to be more complex and time-consuming (e.g., in backdoor trigger recovery). Additionally, overlapping scores in subjective dimensions suggest that LLM-based evaluation struggles to distinguish the research capabilities of different models.
  • Figure 3: Correlation heatmap between objective (x-axis) and subjective (y-axis) metrics for agent-generated solutions across all tasks. Code is included when prompting the LLM to evaluate subjective dimensions. No strong correlation is observed, suggesting that LLM-judged subjective metrics may not reliably indicate empirical performance gains.
  • Figure 4: We track the percentages of changes of performance, runtime, and lines of code compared to baseline across iterative refinement of implementations within a trial of LLM-based MLAB agent on the development set. Performance improvement is the higher the better, while increased runtime and lines of code are the lower the better. These figures show the averaged metrics across all tasks. For results breakdown on each task, please refer to Figure \ref{['fig:imp_index_per_task1']} and \ref{['fig:imp_index_per_task2']} in Appendix \ref{['sec:more-results']}. Together, these figures show that agents tend to over-refine their solutions over time, leading to increased complexity and runtime without proportional performance gains.
  • Figure 5: We perform a cost-effectiveness analysis of various setups. On the x-axis, we plot API cost, where lower is better, and on the y-axis, we show relative improvement to human (Section \ref{['sec:metrics']}), where higher is better.
  • ...and 11 more figures