Table of Contents
Fetching ...

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi

TL;DR

MLR-Bench introduces a comprehensive benchmark for evaluating AI agents on open-ended ML research by aggregating 201 tasks from major conferences, an LLM-based judge (MLR-Judge), and a modular agent scaffold (MLR-Agent) capable of end-to-end or stepwise research. Evaluations across six frontier LLMs and a coding agent reveal that while LLMs generate coherent ideas and papers, coding agents often produce fabricated or invalid experimental results, highlighting major reliability challenges. The study demonstrates strong alignment between MLR-Judge and human reviewers, supporting scalable automated assessment for scientific outputs. Open-sourced resources aim to diagnose weaknesses, foster trustworthy AI-driven discovery, and guide future improvements in automated research workflows.

Abstract

Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

TL;DR

MLR-Bench introduces a comprehensive benchmark for evaluating AI agents on open-ended ML research by aggregating 201 tasks from major conferences, an LLM-based judge (MLR-Judge), and a modular agent scaffold (MLR-Agent) capable of end-to-end or stepwise research. Evaluations across six frontier LLMs and a coding agent reveal that while LLMs generate coherent ideas and papers, coding agents often produce fabricated or invalid experimental results, highlighting major reliability challenges. The study demonstrates strong alignment between MLR-Judge and human reviewers, supporting scalable automated assessment for scientific outputs. Open-sourced resources aim to diagnose weaknesses, foster trustworthy AI-driven discovery, and guide future improvements in automated research workflows.

Abstract

Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

Paper Structure

This paper contains 29 sections, 10 figures, 18 tables.

Figures (10)

  • Figure 1: An overview of the framework of MLR-Bench, consisting of both an end-to-end evaluation (left) and a stepwise evaluation (right), each of which uses LLM judges to automatically assess performance over 201 tasks. For end-to-end evaluation, we use the same model as backbone in idea generation, proposal generation and paper writing. For stepwise evaluation, various models are tested and compared within each step.
  • Figure 2: The number of tasks grouped by our ML primary categories.
  • Figure 2: Evaluated models o4mini2025claude2025deepseekai2025deepseekr1incentivizingreasoningcapabilityministral2024qwenteam2025qwen3gemini2025codex2025geminicli2025 in different research stages.
  • Figure 3: Scores of two LLM judge models across seven review dimensions on ten tasks.
  • Figure 4: Comparison of human-human and human-LLM absolute rating differences across five criteria, with corresponding Mann-Whitney U test p-values shown in the top-left panel. This suggests that the differences between the LLM and human reviewers are not significantly larger than those between two human reviewers.
  • ...and 5 more figures