Table of Contents
Fetching ...

ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Roshan Kenia, Xiaoman Zhang, Pranav Rajpurkar

TL;DR

ReX-MLE introduces a 20-task, end-to-end autonomous-agent benchmark for medical imaging derived from Grand Challenge competitions, designed to test full pipelines under strict compute and time constraints. The study evaluates state-of-the-art agents using multiple LLM backends and reveals a substantial gap to human experts, especially on segmentation and 3D tasks, even when provided with expert solution reports. ACapability analysis using an automated LLM adjudicator shows that domain knowledge and engineering practices are not sufficiently demonstrated by current agents. Time-budget and backend ablations indicate that mere increases in compute or different LLMs do not close the gap, underscoring the need for domain-specific reasoning and robust scientific workflows. Overall, ReX-MLE provides a realistic, domain-focused framework to drive the development of autonomous systems capable of credible medical-imaging research.

Abstract

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

TL;DR

ReX-MLE introduces a 20-task, end-to-end autonomous-agent benchmark for medical imaging derived from Grand Challenge competitions, designed to test full pipelines under strict compute and time constraints. The study evaluates state-of-the-art agents using multiple LLM backends and reveals a substantial gap to human experts, especially on segmentation and 3D tasks, even when provided with expert solution reports. ACapability analysis using an automated LLM adjudicator shows that domain knowledge and engineering practices are not sufficiently demonstrated by current agents. Time-budget and backend ablations indicate that mere increases in compute or different LLMs do not close the gap, underscoring the need for domain-specific reasoning and robust scientific workflows. Overall, ReX-MLE provides a realistic, domain-focused framework to drive the development of autonomous systems capable of credible medical-imaging research.

Abstract

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

Paper Structure

This paper contains 25 sections, 4 figures, 25 tables.

Figures (4)

  • Figure 1: Performance of SOTA AI coding agents on ReX-MLE.
  • Figure 2: Overview of ReX-MLE Task Categories. This figure illustrates the four distinct task types included in the benchmark: Segmentation, Detection, Classification, and Image Generation.
  • Figure 3: Autonomous Agent Interaction Workflow. This diagram depicts the workflow between the Medical Image Challenge Environment and the AI Agent. The environment provides task instructions, data, and grading feedback, while the agent iteratively performs strategy generation, error analysis, coding, debugging, and model training to produce a final submission.
  • Figure 4: Comparison of ML research agent capabilities across 13 key success factors.