MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu; Yuchen Zhuang; Yishan Zhong; Yue Yu; Zifeng Wang; Xiangru Tang; Hang Wu; May D. Wang; Peifeng Ruan; Donghan Yang; Tao Wang; Guanghua Xiao; Xin Liu; Carl Yang; Yang Xie; Wenqi Shi

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

TL;DR

MedAgentGym tackles the challenge of coding-based biomedical reasoning by providing a scalable, executable training environment that pairs diverse, real-world biomedical tasks with interactive back-and-forth execution feedback. The framework unifies problem formulation, data curation, and an executable sandbox to enable large-scale trajectory sampling and agentic RL fine-tuning, including a verifier-based reward model. Empirical results across 29 LLMs reveal substantial gaps between commercial and open-source models and show that Med-Copilot can gain about +43% offline and +45% online RL performance, approaching gpt-4o on several tasks. The work emphasizes trajectory sampling, agentic learning, and robust debugging as key to scaling biomedical coding agents, and it provides a public benchmark and resources to accelerate reproducible development in biomedical data science.

Abstract

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

TL;DR

Abstract

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)