Table of Contents
Fetching ...

Linking Perception, Confidence and Accuracy in MLLMs

Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.

Linking Perception, Confidence and Accuracy in MLLMs

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
Paper Structure (35 sections, 12 equations, 12 figures, 9 tables)

This paper contains 35 sections, 12 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Disconnection Between Model Confidence and Accuracy Under Perception Degradation. The X-axis ('Perception') shows the input image with progressively increasing noise. The plot demonstrates that while the Mean Confidence remain highly stable (insensitive), the Model Accuracy descends sharply, revealing a significant gap between the model's self-reported certainty and its actual performance as the visual input degrades.
  • Figure 2: Framework Overview. The upper panel (a) illustrates how original-noise image pairs are used to optimize the model via Reinforcement Learning, driven by a Confidence-based Calibration Reward and a Accuracy Reward. The bottom panel (b) shows the adaptive Confidence-Aware Test-Time Scaling (CA-TTS) system, where an Expert Model acts as a Planner, Voter, and Critic to coordinate the Self-Consistency, Self-Reflection, and Self-Check modules, which collaborate to produce the final answer.
  • Figure 3: Test-time scaling comparison on Math-Vision. Accuracy vs. number of samples for our CA-TTS (blue), Majority Voting (green), and DeepConf (yellow). The slope of our method ($\beta_1 = 3.65$) is 2.2-3.1$\times$ steeper than baselines ($\beta_2 = 1.64$, $\beta_3 = 1.19$), demonstrating superior scaling potential with calibrated confidence.
  • Figure 4: A case study comparing the reasoning processes of ToT yao2023treethoughtsdeliberateproblem and CA-TTS (Ours). ToT (upper) conducts a complex tree search that remains vulnerable to a single-point-of-failure in its final evaluation, leading it to the incorrect answer. In contrast, Our method (bottom) demonstrates a multi-stage, resilient process: an initial error from Self-Consistency (Answer: 4) is corrected by Self-Reflection (Answer: 6) and confirmed by Self-Check.
  • Figure 5: The prompt template used for the Voter Expert. The model acts as a discriminator to assign probability scores to candidate choices, facilitating confidence-weighted voting.
  • ...and 7 more figures