Table of Contents
Fetching ...

Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, Zeynep Akata

TL;DR

This work proposes an effective, training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal, and validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.

Abstract

Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or finding key moments in long videos. Existing works typically rely on complicated, task-specific fine-tuning, which limits their generalizability and increases model complexity. In this work, we propose an effective, training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal. Our core insight is that a model's output entropy decreases when presented with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data. We apply this simple principle to three complex visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned methods. Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.

Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

TL;DR

This work proposes an effective, training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal, and validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.

Abstract

Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or finding key moments in long videos. Existing works typically rely on complicated, task-specific fine-tuning, which limits their generalizability and increases model complexity. In this work, we propose an effective, training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal. Our core insight is that a model's output entropy decreases when presented with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data. We apply this simple principle to three complex visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned methods. Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.

Paper Structure

This paper contains 27 sections, 2 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Empirical validation of our core hypothesis on $V^*$ Bench. As the visual input is increasingly focused on the target object (higher zoom-in ratio), the MLLM's output entropy (red line, right axis) decreases while task accuracy (blue line, left axis) consistently increases. This demonstrates a strong inverse correlation, motivating entropy minimization as a guidance signal.
  • Figure 2: An overview of our Uncertainty-Guided (UG) Framework. The framework follows a two-stage process: (1) Scoring Stage: Candidate visual inputs (image crops or video frames) are scored using the MLLM's intrinsic uncertainty, measured by either Token Entropy or Binary Response Confidence (BRC) score. (2) Answering Stage: The input with the lowest uncertainty are used for a final inference to generate the definitive answer.
  • Figure 3: (a) Ablation on Input Granularity: Performance as a function of visual crop size (for UG-Search) and frame window size (for UG-Sample and UG-Ground). (b) Scaling Properties: Performance of the baseline and our UG-enhanced models across the InternVL-2.5 family.
  • Figure 4: Qualitative Results. (a) UG-Search localizes small target objects by selecting the lowest-entropy crop (red box). (b) UG-Sample identifies key semantic frames (red box) with the lowest entropy from a long video. (c) UG-Ground pinpoints the correct event timeline by finding the peak in its BRC score sequence.
  • Figure 5: (a) Correlation between sub-task accuracy and entropy. (b) Entropy distribution for correct and incorrect predictions.
  • ...and 3 more figures