Table of Contents
Fetching ...

GEM: A Gym for Agentic LLMs

Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin

TL;DR

GEM introduces a Gym-like framework for agentic LLMs to address the lack of multi-turn, long-horizon RL environments. It provides a standardized interface, asynchronous vectorization, and a diverse task suite with tool integration, enabling both training and evaluation across multiple RL frameworks. A simple yet effective variant of REINFORCE with Return Batch Normalization (ReBN) is shown to perform robustly across dense per-turn rewards and arbitrary discount factors, and GEM demonstrates insights into gamma effects, tool usage, and cross-task generalization. The framework serves as both a training playground and unified evaluation toolkit, with practical impact in accelerating research on autonomous, tool-using LLM agents.

Abstract

The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

GEM: A Gym for Agentic LLMs

TL;DR

GEM introduces a Gym-like framework for agentic LLMs to address the lack of multi-turn, long-horizon RL environments. It provides a standardized interface, asynchronous vectorization, and a diverse task suite with tool integration, enabling both training and evaluation across multiple RL frameworks. A simple yet effective variant of REINFORCE with Return Batch Normalization (ReBN) is shown to perform robustly across dense per-turn rewards and arbitrary discount factors, and GEM demonstrates insights into gamma effects, tool usage, and cross-task generalization. The framework serves as both a training playground and unified evaluation toolkit, with practical impact in accelerating research on autonomous, tool-using LLM agents.

Abstract

The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

Paper Structure

This paper contains 28 sections, 4 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Learning curves of Qwen3-based agents across diverse environments of 5 categories: game (language games); rg (ReasoningGym); code (coding tasks); math (python-integrated math questions); qa (search-integrated general questions). All agents are learned via a simple yet general multi-turn algorithm based on REINFORCE (\ref{['algorithm:mt_reinforce']}). The comparison between two curves in each subplot illustrate the effectiveness of Return Batch Normalization (ReBN).
  • Figure 2: Illustration of autoreset in vectorized environments. Autoresetting resets the environment automatically after termination, allowing users to collect batches of episodes by simply running .step() without needing more complicated logic such as keeping track of whether individual episodes have terminated.
  • Figure 3: The illustration of different view of agentic RL. Green nodes denote tokens responsible for loss.
  • Figure 4: Algorithm benchmarking using eight representative environments from GEM. All agents are trained from Qwen3-{scale}-Base models, with scale specified in each plot. rg refers to single-turn reasoning tasks from ReasoningGym; game consists of long-horizon language games; qa and math are tool-integrated multi-turn environments.
  • Figure 5: (a) Average number of turns and episode return when trained with different discount factors. (b) Comparative experiment results on tool availability.
  • ...and 7 more figures