Table of Contents
Fetching ...

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh

Abstract

As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Abstract

As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.
Paper Structure (29 sections, 18 figures, 4 tables)

This paper contains 29 sections, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Overview of Gym-V. Top: 105 single-turn and 74 multi-turn environments across 10 categories. Bottom: a unified reset/step interface shared by interactive environments, offline datasets, and evaluation benchmarks.
  • Figure 2: Evaluation fidelity of Gym-V against official pipelines. (a) VLM evaluation on Qwen2.5-VL-7B-Instruct reproduces official VLMEvalKit scores. (b) Text-to-image and image-editing evaluation closely matches official GenExam, RISE, and GenEval results.
  • Figure 3: Left: Single-turn evaluation ($\times 100$) across 7 domain categories . Right: Multi-turn evaluation ($\times 100$) across Games (12 envs), Spatial/2D (6 Minigrid envs), and Spatial/3D (3 MiniWorld envs). Avg: mean over all 10 columns reported in this table. Best result per column in bold. Note that Minigrid (Sp./2D) uses shaped episodic returns that include negative penalties (e.g., stepping into hazards), so mean@3 returns can be below zero.
  • Figure 4: Training reward curves for GRPO, GSPO, and SAPO across 12 single-turn (rows 1--3) and 4 multi-turn (row 4) environments. Smoothed curves (bold) are overlaid on raw trajectories (translucent). All environments exhibit learnable reward signals; no single algorithm dominates uniformly. Multi-turn games show slower convergence and lower absolute returns, reflecting the compounding difficulty of sequential decision-making.
  • Figure 5: Training reward curves for context modeling (top row) and rules injection (bottom row) ablations on four multi-turn games. Top: with context (3-turn, red; 5-turn, blue) vs. without context (green). Bottom: with rules (green) vs. without rules (red). Smoothed curves (bold) are overlaid on raw trajectories (translucent).
  • ...and 13 more figures