Table of Contents
Fetching ...

Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang

TL;DR

This work investigates how visual evidence informs multimodal reasoning in RLVR by revealing a bimodal cross-modal attention structure where ~$15\%$ of tokens serve as perceptual anchors. It introduces Anchor-Token Reinforcement Learning (AT-RL), a lightweight, graph-based framework that identifies anchors, clusters tokens semantically, and modulates the reinforcement signal with cluster-aware weights $A_{AT}^{(i,t)} = W(C_k) \cdot A^{(i)}$, focusing learning on grounding tokens while preserving linguistic coherence. Across 3B–32B Qwen2.5-VL models, AT-RL achieves consistent gains over strong baselines, with the 32B model surpassing the $72\text{B}$-Instruct baseline on MathVista at $80.2$, and incurs only $1.2\%$ training overhead. The method generalizes to STEM, video, and general multimodal tasks, remains compatible with KL-free and divergence-agnostic regimes, and clarifies that effective multimodal reasoning hinges on the fidelity of cross-modal anchoring rather than token quantity.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.

Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning

TL;DR

This work investigates how visual evidence informs multimodal reasoning in RLVR by revealing a bimodal cross-modal attention structure where ~ of tokens serve as perceptual anchors. It introduces Anchor-Token Reinforcement Learning (AT-RL), a lightweight, graph-based framework that identifies anchors, clusters tokens semantically, and modulates the reinforcement signal with cluster-aware weights , focusing learning on grounding tokens while preserving linguistic coherence. Across 3B–32B Qwen2.5-VL models, AT-RL achieves consistent gains over strong baselines, with the 32B model surpassing the -Instruct baseline on MathVista at , and incurs only training overhead. The method generalizes to STEM, video, and general multimodal tasks, remains compatible with KL-free and divergence-agnostic regimes, and clarifies that effective multimodal reasoning hinges on the fidelity of cross-modal anchoring rather than token quantity.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.
Paper Structure (40 sections, 21 equations, 5 figures, 10 tables)

This paper contains 40 sections, 21 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: (a) In multimodal CoTs, only a minority of tokens (approximately 15%) exhibit high cross-modal connectivity and act as "perceptual anchors" that ground reasoning in visual evidence, while the majority are low-connectivity tokens fulfilling linguistic structures. (b) RLVR using AT-RL to modulate reinforcement signals via cluster-based soft weighting yields substantial reasoning improvements that scale from 3B to 32B. By precisely optimizing these critical anchors, our 32B model achieves 80.2 on MathVista and 56.6 on MathVerse.
  • Figure 2: Overview of the Anchor-Token Reinforcement Learning (AT-RL) framework. Our method refines the reinforcement signal through a three-stage connectivity-aware analysis: (i) Locating Anchors: Extracting cross-modal attention footprints from the MLLM and applying debiasing to identify key perceptual tokens; (ii) Token Grouping: Constructing a functional dependency graph based on attention similarity and partitioning it into semantic clusters $\{C_k\}$ via the METIS algorithm; (iii) Credit Assignment: Quantifying cluster-level importance $W(C_k)$ to perform soft-weighting on the base advantage $A^i$. The resulting modulated advantage $A_{AT}^{(i,t)}$ ensures that policy gradients are prioritized for tokens essential to visual grounding while preserving the logical coherence of the reasoning chain.
  • Figure 3: Cross-modal connectivity patterns in multimodal Chain-of-Thought reasoning. (a) Distribution of connectivity density across generated tokens. A minority of tokens exhibit high connectivity, while the majority have low connectivity. (b) & (c) Word clouds of the top 100 tokens with the highest and lowest average connectivity density, respectively, selected from frequently occurring tokens. A larger font size indicates higher average connectivity. Tokens with high connectivity typically serve as perceptual anchors that ground reasoning in visual evidence, whereas tokens with low connectivity primarily maintain linguistic coherence along the reasoning path.
  • Figure 4: Case study of perceptual anchor clusters in MathVision reasoning. (a) Importance ranking of token clusters based on average connectivity density. (b) Representative tokens for each cluster, illustrating the semantic difference between high-importance and low-importance clusters.
  • Figure 5: Performance and error breakdown on MathVision (304 questions): accuracy and error composition for Zero-shot (22.7%), AT-RL (30.3%), FT-RL (27.6%), and GRPO (26.0%). Right: absolute error counts by type.