Table of Contents
Fetching ...

Boosting Reasoning in Large Multimodal Models via Activation Replay

Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang

TL;DR

RLVR enhances multimodal reasoning but its internal effects on activations are not well understood. Through logit-lens analysis, the authors show RLVR disproportionately shifts low-entropy input activations and link these shifts to reasoning performance, motivating a training-free intervention. Activation Replay introduces zero-initialized learnable tokens at test time to minimize $D_{kl}(P_{base} \| P_{rlvr})$ for low-entropy activations, aligning RLVR behavior with base-model distributions. Across math, agentic, and video reasoning, Activation Replay yields consistent improvements in Pass@K and reasoning coverage without policy optimization. The work provides both mechanistic insight into activation dynamics and a scalable, practical method to boost reasoning in post-trained LMMs.

Abstract

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

Boosting Reasoning in Large Multimodal Models via Activation Replay

TL;DR

RLVR enhances multimodal reasoning but its internal effects on activations are not well understood. Through logit-lens analysis, the authors show RLVR disproportionately shifts low-entropy input activations and link these shifts to reasoning performance, motivating a training-free intervention. Activation Replay introduces zero-initialized learnable tokens at test time to minimize for low-entropy activations, aligning RLVR behavior with base-model distributions. Across math, agentic, and video reasoning, Activation Replay yields consistent improvements in Pass@K and reasoning coverage without policy optimization. The work provides both mechanistic insight into activation dynamics and a scalable, practical method to boost reasoning in post-trained LMMs.

Abstract

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

Paper Structure

This paper contains 28 sections, 6 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Lidar Plot for Performance Gains. RLVR LMMs include VL-Rethinker, DeepEyes and Video-R1 wang2025vlrethinkerzheng2025deepeyesfeng2025videor1. Our method boosts multimodal reasoning across diverse tasks consistently in training-free manner.
  • Figure 1: Case Study on Mathematical Reasoning.
  • Figure 2: Qualitative Case on Logit Lens. The math visual input is from wang2024mathvision. Top-2 or Top-3 predictions (words) of input activations shift from base to RLVR couterpart.
  • Figure 2: False Tool Call. Case Study of Multi-Turn o3-Like Agent zheng2025deepeyes.
  • Figure 3: How LMM Input Activations are Affected after RLVR. From left to right in subplots are low to high base LMM entropy. The shifts of KL divergence is normalized layerwise for illustration purpose. Brighter color suggests relatively more severe shifts on KL divergence.
  • ...and 6 more figures