Table of Contents
Fetching ...

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

Soumya Suvra Ghosal, Youngeun Kim, Zhuowei Li, Ritwick Chaudhry, Linghan Xu, Hongjing Zhang, Jakub Zablocki, Yifan Xing, Qin Zhang

TL;DR

This work proposes VisRef, a visually grounded test-time scaling framework that actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning.

Abstract

Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

TL;DR

This work proposes VisRef, a visually grounded test-time scaling framework that actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning.

Abstract

Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
Paper Structure (21 sections, 19 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 19 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration comparing our approach with prior test-time scaling methods. (a) Textual self-reflection-based test-time scaling muennighoff2025s1aggarwal2025l1 extends reasoning by encouraging the model to think longer, but progressively loses grounding in the visual input as the reasoning chain grows. (b) Training-free Visual Refocusing (ours) dynamically selects and re-injects reasoning-relevant visual cues during inference, effectively restoring visual grounding without retraining and yielding substantial accuracy gains on the MathVista and MM-Star benchmarks with InternVL3.5-8B.
  • Figure 2: Overview of VisRef. Given an image-text input, VisRef enables multi-modal large reasoning models (MLRMs) to maintain visual grounding throughout the reasoning process without retraining. At each reasoning step, the model projects visual tokens into the textual reasoning subspace and selects a subset of reasoning-relevant tokens via a determinantal point process (DPP)-based criterion. The selected tokens are then reinjected to guide subsequent reasoning. This iterative process continues until the entropy of the model's answer distribution falls below a confidence threshold $\delta_{\mathrm{entropy}}$, forming an adaptive stopping criterion. The bottom-left plot shows the visual-to-text attention ratio across reasoning steps. As shown, our method maintains a higher level of visual attention (green curve) by reinjecting a carefully selected coreset of visual tokens, whereas text-based self-reflection (red curve) exhibits a more rapid decline in visual attention, consistent with prior observations yang2025lookchu2025qwen. Example shown uses InternVL-3.5-8B on MathVista.
  • Figure 3: Test-time scaling of VisRef. We evaluate the test-time scaling behavior of VisRef by generating multiple parallel visual-integrated reasoning chains under a fixed token budget. Results are shown across three benchmarks (MathVision, MathVista, and MM-Star) and three MLRMs: InternVL-3.5-8B (first row), Qwen-3-VL-8B (second row), and SAIL-VL2 (third row). The star marker (☆) denotes standard thinking—the baseline with no additional test-time compute. Parallel thinking ghosal2025doeswang2022self generates multiple parallel chains-of-thought without visual refocusing. Across all models and benchmarks, VisRef consistently achieves superior accuracy for any given computational budget.
  • Figure 4: Ablation study of hyper-parameters. (a) We visualize the ablation results of the entropy threshold $\delta_{\mathrm{entropy}}$ used as the stopping criterion. Although the accuracy does not vary significantly across different thresholds, our evaluation shows that $\delta_{\mathrm{entropy}} = 0.25$ achieves the best balance between accuracy and inference efficiency. (b) We show the ablation of the token budget $m$ (fraction of visual tokens selected) on MathVista using InternVL-3.5-8B. Accuracy improves from 76.1% to 79.2% as $m$ increases from 20% to 30% but plateaus for $m \geq 30\%$.
  • Figure 5: Attention Visualization. Attention maps show how VisRef progressively refocuses on relevant visual regions during multi-step reasoning. Initially, the attention maps are noisy. With visual reinjection, VisRef reinforces grounding on task-critical objects, leading to more accurate visual reasoning.
  • ...and 2 more figures