Unleashing Perception-Time Scaling to Multimodal Reasoning Models

Yifan Li; Zhenghao Chen; Ziheng Wu; Kun Zhou; Ruipu Luo; Can Zhang; Zhentao He; Yufei Zhan; Wayne Xin Zhao; Minghui Qiu

Unleashing Perception-Time Scaling to Multimodal Reasoning Models

Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu

TL;DR

This work interrogates whether inference-time scaling benefits perception in LVLMs and finds limited gains under the fast-perception paradigm. It introduces DisTANCE, a perception-centric visual estimation benchmark, and Perception-Time Scaling (PTS), which imposes token-rich perception elaboration and stepwise decomposition. Through a two-stage training (SFT followed by RL with GRPO) on synthetic PTS data and integration with math reasoning data, PTS achieves substantial gains on perception (e.g., high-precision $RA_{avg}$ improvements) and generalizes to out-of-domain tasks. The study demonstrates that modeling perception as a structured process enables perception to benefit from inference-time scaling, with stronger image-grounding and broader multimodal improvements, offering a path toward more perception-aware LVLMs.

Abstract

Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.

Unleashing Perception-Time Scaling to Multimodal Reasoning Models

TL;DR

Abstract

Unleashing Perception-Time Scaling to Multimodal Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)