Table of Contents
Fetching ...

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

TL;DR

VOGUE tackles the exploration problem in multimodal RLVR by treating the visual input as a stochastic context and quantifying its uncertainty. It employs a dual-branch forward pass (raw vs. noisy images) and uses the symmetric KL divergence between the resulting text policies as a visual-uncertainty signal to modulate learning via an uncertainty bonus, a token-entropy bonus, and an annealed sampling schedule. Empirical results on six benchmarks across two model scales show consistent improvements in pass@1 and pass@4 over strong baselines, demonstrating enhanced robustness and reduced exploration decay. The approach is modular, practical, and can be integrated with existing policy-gradient methods, offering a principled way to fuse visual uncertainty into exploration for multimodal reasoning.

Abstract

Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $\textbf{VOGUE (Visual Uncertainty Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy's sensitivity to visual perturbations using the symmetric KL divergence between a "raw" and "noisy" branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

TL;DR

VOGUE tackles the exploration problem in multimodal RLVR by treating the visual input as a stochastic context and quantifying its uncertainty. It employs a dual-branch forward pass (raw vs. noisy images) and uses the symmetric KL divergence between the resulting text policies as a visual-uncertainty signal to modulate learning via an uncertainty bonus, a token-entropy bonus, and an annealed sampling schedule. Empirical results on six benchmarks across two model scales show consistent improvements in pass@1 and pass@4 over strong baselines, demonstrating enhanced robustness and reduced exploration decay. The approach is modular, practical, and can be integrated with existing policy-gradient methods, offering a principled way to fuse visual uncertainty into exploration for multimodal reasoning.

Abstract

Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce , a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy's sensitivity to visual perturbations using the symmetric KL divergence between a "raw" and "noisy" branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.

Paper Structure

This paper contains 23 sections, 5 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: VOGUE for RL fine-tuning. Our method uses a dual-branch forward pass: the raw branch processes the original image, while the noisy branch receives a perturbed view. Token-level symmetric KL between branches provides a visual-uncertainty signal used to shape the noisy-branch advantage. An entropy bonus on both branches maintains output stochasticity, and an annealed sampling schedule balances exploration and exploitation by favoring the noisy branch early in training.
  • Figure 2: Training accuracy rewards of GRPO and VOGUE on Qwen2.5-VL 3B and 7B models. VOGUE consistently achieves higher rewards than GRPO throughout training.
  • Figure 3: Ablation studies on the effects of visual uncertainty, token entropy, sampling strategy, divergence measure, and noise level. (a) Visual uncertainty and token entropy bonuses each improve performance, and together yield the best results. (b) Annealed sampling outperforms fixed sampling, confirming the benefit of dynamically controlling. (c–d) Symmetric KL provides stable gains, while forward KL causes excessive visual uncertainty and degraded accuracy. (e–f) Moderate noise ($\sigma=0.4$) yields the best accuracy, while low noise limits exploration and high noise increases variance.