Table of Contents
Fetching ...

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu

TL;DR

MLGym presents the first Gym-based framework for AI Research Agents and a benchmark suite (MLGym-Bench) of 13 open-ended tasks spanning CV, NLP, RL, and game theory. The approach enables training and evaluating LLM agents across diverse, real-world research workflows, using a novel AUP-based evaluation that accounts for multiple task-specific metrics and artifacts. Key findings show frontier LLMs improve baselines primarily through hyperparameter tuning rather than generating novel hypotheses or new algorithms, underscoring the need for richer evaluation of creative scientific contributions. The work provides open-source tooling and benchmarks to catalyze reproducible progress in AI-driven scientific discovery and agent-based AI research.

Abstract

We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

TL;DR

MLGym presents the first Gym-based framework for AI Research Agents and a benchmark suite (MLGym-Bench) of 13 open-ended tasks spanning CV, NLP, RL, and game theory. The approach enables training and evaluating LLM agents across diverse, real-world research workflows, using a novel AUP-based evaluation that accounts for multiple task-specific metrics and artifacts. Key findings show frontier LLMs improve baselines primarily through hyperparameter tuning rather than generating novel hypotheses or new algorithms, underscoring the need for richer evaluation of creative scientific contributions. The work provides open-source tooling and benchmarks to catalyze reproducible progress in AI-driven scientific discovery and agent-based AI research.

Abstract

We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

Paper Structure

This paper contains 44 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Diagram of MLGym, a unified framework designed to integrate diverse and open-ended AI research tasks into a single platform for developing and evaluating LLM agents on these tasks.
  • Figure 2: Performance profiles comparing Best Attempt@4 and Best Submission@4 across all models and tasks. The x-axis shows the performance ratio threshold $\tau$ and the y-axis shows the fraction of tasks where a model achieves performance within $\tau$ of the best model.
  • Figure 3: Best Attempt AUP@4 vs cost for all models. The x-axis shows the API cost in USD and the y-axis shows the AUP@4 score.
  • Figure 4: Termination Error Distribution by model. The size of the bars corresponds to the number of times each model triggered an exit status.
  • Figure 5: Number of Failed and Incomplete runs per model. The criteria for marking a run as incomplete or failed is described in \ref{['sec:failure_analysis']}
  • ...and 7 more figures