Table of Contents
Fetching ...

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

TL;DR

MedAgentGym tackles the challenge of coding-based biomedical reasoning by providing a scalable, executable training environment that pairs diverse, real-world biomedical tasks with interactive back-and-forth execution feedback. The framework unifies problem formulation, data curation, and an executable sandbox to enable large-scale trajectory sampling and agentic RL fine-tuning, including a verifier-based reward model. Empirical results across 29 LLMs reveal substantial gaps between commercial and open-source models and show that Med-Copilot can gain about +43% offline and +45% online RL performance, approaching gpt-4o on several tasks. The work emphasizes trajectory sampling, agentic learning, and robust debugging as key to scaling biomedical coding agents, and it provides a public benchmark and resources to accelerate reproducible development in biomedical data science.

Abstract

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

TL;DR

MedAgentGym tackles the challenge of coding-based biomedical reasoning by providing a scalable, executable training environment that pairs diverse, real-world biomedical tasks with interactive back-and-forth execution feedback. The framework unifies problem formulation, data curation, and an executable sandbox to enable large-scale trajectory sampling and agentic RL fine-tuning, including a verifier-based reward model. Empirical results across 29 LLMs reveal substantial gaps between commercial and open-source models and show that Med-Copilot can gain about +43% offline and +45% online RL performance, approaching gpt-4o on several tasks. The work emphasizes trajectory sampling, agentic learning, and robust debugging as key to scaling biomedical coding agents, and it provides a public benchmark and resources to accelerate reproducible development in biomedical data science.

Abstract

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

Paper Structure

This paper contains 53 sections, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Overview of (a) task-specific and (b) overall leaderboard evaluation in MedAgentGym. The results show the (a) performance variations across biomedical data science tasks and (b) large gaps between proprietary and open-source (OSS) LLMs, highlighting the need for continued development of privacy-preserving, affordable LLM agents, especially for complex code-based biomedical reasoning tasks such as biomedical software engineering and predictive modeling.
  • Figure 2: Overview of MedAgentGym. MedAgentGym contains a comprehensive suite of coding-centric biomedical data science tasks with an interactive execution environment for LLM agents.
  • Figure 3: Comparison of (a) offline and (b) online RL paradigms within MedAgentGym.
  • Figure 4: Scalable improvements of LLM agents in MedAgentGym. For inference-time scaling, we employ $T=0$ for the initial rollout and $T=0.6$ for the rest. For train-time scaling, we set $T=0$.
  • Figure 5: Self-Improvement
  • ...and 12 more figures