Table of Contents
Fetching ...

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, Klara Nahrstedt

TL;DR

This paper investigates whether inference-time scaling techniques that improve mathematical reasoning in LLMs transfer to vision-language models (VLMs), focusing on RL-finetuned VLMs. It contrasts generation-centric strategies (e.g., decoding-time majority voting) with verification-centric approaches (e.g., Best-of-$N$ with self-verification) on multimodal benchmarks GeoQA and MathVista, and includes an A-ha moment search using GPT-4o to detect backtracking and verification behaviors. The authors report three key findings: generation-time capability dominates verification-based methods, A-ha moments are rare and do not reliably boost accuracy, and visual information is not effectively integrated into self-verification. Overall, current RL-trained VLMs show limited self-verification benefits in multimodal inference-time scaling, signaling the need for stronger multimodal verification mechanisms to unlock meaningful gains in visual mathematical reasoning.

Abstract

Inference time techniques such as decoding time scaling and self refinement have been shown to substantially improve mathematical reasoning in large language models (LLMs), largely attributed to emergent self correction and self verification behaviors often elicited through reinforcement learning (RL). In this work, we ask whether the same recipe transfers to vision language models (VLMs), especially RL finetuned variants that claim strong visual mathematical reasoning. Through extensive evaluation, we reach three main findings that differ markedly from text only models. First, generation time capability matters more than verification and refinement: simple majority voting consistently and substantially outperforms verification centric strategies such as best of N with self verification. Second, behaviors often associated with RL tuned models at inference time, such as the 'Aha moment,' do not yield reliable reasoning performance improvements. Third, visual information is not effectively integrated into the model's self verification process. Overall, our analysis highlights a key limitation: current RL trained VLMs derive limited benefit from self verification in the visual modality, which constrains the effectiveness of inference time scaling for visual mathematical reasoning.

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

TL;DR

This paper investigates whether inference-time scaling techniques that improve mathematical reasoning in LLMs transfer to vision-language models (VLMs), focusing on RL-finetuned VLMs. It contrasts generation-centric strategies (e.g., decoding-time majority voting) with verification-centric approaches (e.g., Best-of- with self-verification) on multimodal benchmarks GeoQA and MathVista, and includes an A-ha moment search using GPT-4o to detect backtracking and verification behaviors. The authors report three key findings: generation-time capability dominates verification-based methods, A-ha moments are rare and do not reliably boost accuracy, and visual information is not effectively integrated into self-verification. Overall, current RL-trained VLMs show limited self-verification benefits in multimodal inference-time scaling, signaling the need for stronger multimodal verification mechanisms to unlock meaningful gains in visual mathematical reasoning.

Abstract

Inference time techniques such as decoding time scaling and self refinement have been shown to substantially improve mathematical reasoning in large language models (LLMs), largely attributed to emergent self correction and self verification behaviors often elicited through reinforcement learning (RL). In this work, we ask whether the same recipe transfers to vision language models (VLMs), especially RL finetuned variants that claim strong visual mathematical reasoning. Through extensive evaluation, we reach three main findings that differ markedly from text only models. First, generation time capability matters more than verification and refinement: simple majority voting consistently and substantially outperforms verification centric strategies such as best of N with self verification. Second, behaviors often associated with RL tuned models at inference time, such as the 'Aha moment,' do not yield reliable reasoning performance improvements. Third, visual information is not effectively integrated into the model's self verification process. Overall, our analysis highlights a key limitation: current RL trained VLMs derive limited benefit from self verification in the visual modality, which constrains the effectiveness of inference time scaling for visual mathematical reasoning.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Commercial VLM (GPT 5 series) fails to verify its counting results.