Table of Contents
Fetching ...

Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Mingzhu Chen, Jiancan Wu, Kuien Liu, Xiang Wang

Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

Paper Structure

This paper contains 23 sections, 25 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of different supervision strategies in multimodal RLVR. (A) Outcome-only supervision may lead to perception failures, where incorrect visual evidence is used for reasoning. (B) Adding perception supervision improves visual grounding but does not guarantee correct reasoning transitions. (C) Our trajectory supervision aligns the perception--reasoning process, enabling reasoning grounded in correct visual evidence and leading to correct answers.
  • Figure 2: Overview of Trajectory Guided Reinforcement Learning (TGRL). Given an image-question pair, we construct a rollout group by integrating on-policy and expert trajectories. Expert trajectories are then reweighted at the token level and filtered by correctness, enabling trajectory-level alignment within RLVR framework.