Table of Contents
Fetching ...

PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei

TL;DR

PyVision-RL is introduced, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction and combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use.

Abstract

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

PyVision-RL: Forging Open Agentic Vision Models via RL

TL;DR

PyVision-RL is introduced, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction and combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use.

Abstract

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
Paper Structure (44 sections, 2 equations, 23 figures, 3 tables, 1 algorithm)

This paper contains 44 sections, 2 equations, 23 figures, 3 tables, 1 algorithm.

Figures (23)

  • Figure 1: Agentic scaffolds of PyVision-RL. We design two agentic scaffolds for image and video understanding under a unified framework of dynamic tooling with Python. For PyVision-Image, both the system prompt and image hints are injected into the MLLM context, and the images are also loaded into the Python runtime. For PyVision-Video, only the system prompt is injected into the MLLM context, while the video is loaded exclusively into the runtime environment. Given a query, the model interleaves reasoning with executable code blocks (code_block_0) to process multimodal inputs. Execution results (mm_clue_0), including textual outputs and rendered images, are appended to the context and fed back to the model. This interaction loop repeats until a final answer is produced. By restricting video inputs first to the runtime, PyVision-Video enables on-demand context construction, where the agent selectively samples and plots task-relevant frames during reasoning, substantially improving visual token efficiency (\ref{['fig:jit']}).
  • Figure 2: Comparison between frame sampling and on-demand context construction. (a) Conventional video MLLMs, e.g., the Qwen-VL series, process videos by uniformly sampling frames and directly injecting them into the model context. (b) In PyVision-Video, we adopt on-demand context construction: the video is loaded only into the Python runtime, and the model selectively samples and plots relevant frames via Python code during the reasoning process, largely improve the token efficiency.
  • Figure 3: Training dynamics of RL for PyVision-Image. Our training algorithm yields stable optimization and steadily improving performance. Entropy loss and gradient norm decrease smoothly over training, indicating stable RL dynamics. Meanwhile, validation performance on V*, accuracy reward, response length, and the mean number of tool calls consistently increase, showing that the model learns sustained, long-horizon tool-using behavior.
  • Figure 4: Efficiency performance trade-off on VSI-Bench. Thanks to on-demand context construction, PyVision-Video selectively samples task-relevant frames during reasoning, achieving higher accuracy with substantially fewer visual tokens compared to frame-sampling baselines such as Qwen2.5-VL series.
  • Figure 5: Ablation of training components. We report the average performance over seven benchmarks (V* avg@32, HRBench-4K, HRBench-8K, MathVision, MathVerse, WeMath, and DynaMath) under different training configurations, each ablating one component of our method. The Ours setting uses a max turn budget of 4, includes the accumulative tool reward, applies standard deviatio sorting for rollout groups, and removes standard deviation normalization term in advantage estimation. All other settings modify exactly one component relative to Ours. Overall, we observe that (1) applying standard deviation sorting or removing standard deviation normalization consistently improves performance, and (2) incorporating the accumulative tool reward or increasing the max turn budget leads to larger performance gains in later training stages. For example, at step 600, a max turn budget of 4 outperforms a budget of 2 by 1.93%.
  • ...and 18 more figures