MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Yunxiang Zhang; Muhammad Khalifa; Shitanshu Bhushan; Grant D Murphy; Lajanugen Logeswaran; Jaekyeom Kim; Moontae Lee; Honglak Lee; Lu Wang

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

TL;DR

MLRC-Bench catalogs seven frontiers from ML conferences into a repository-level research challenge where language agents must both generate novel research ideas and implement them under compute constraints. It combines objective evaluation across Effectiveness, Efficiency, and Simplicity with a Relative Improvement to Human baseline to enable cross-task comparisons and fair benchmarking. The results show that even the best agents achieve only modest improvements (e.g., 9.3% of the human gap) and that LLM-judged novelty poorly tracks empirical performance, underscoring gaps between perceived and actual research impact. The benchmark is designed to evolve with the field, promote rigorous evaluation of AI-assisted research, and encourage safer, scalable automation of scientific discovery.

Abstract

We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

TL;DR

Abstract

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)