Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning
Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang
TL;DR
This work investigates how visual evidence informs multimodal reasoning in RLVR by revealing a bimodal cross-modal attention structure where ~$15\%$ of tokens serve as perceptual anchors. It introduces Anchor-Token Reinforcement Learning (AT-RL), a lightweight, graph-based framework that identifies anchors, clusters tokens semantically, and modulates the reinforcement signal with cluster-aware weights $A_{AT}^{(i,t)} = W(C_k) \cdot A^{(i)}$, focusing learning on grounding tokens while preserving linguistic coherence. Across 3B–32B Qwen2.5-VL models, AT-RL achieves consistent gains over strong baselines, with the 32B model surpassing the $72\text{B}$-Instruct baseline on MathVista at $80.2$, and incurs only $1.2\%$ training overhead. The method generalizes to STEM, video, and general multimodal tasks, remains compatible with KL-free and divergence-agnostic regimes, and clarifies that effective multimodal reasoning hinges on the fidelity of cross-modal anchoring rather than token quantity.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.
