Table of Contents
Fetching ...

Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework

Zhuo Zhi, Chen Feng, Adam Daneshmend, Mine Orlu, Andreas Demosthenous, Lu Yin, Da Li, Ziquan Liu, Miguel R. D. Rodrigues

TL;DR

The paper addresses the challenge of reliable multimodal reasoning with limited training and annotation by introducing SRICE, a training-free framework that couples external vision tools with uncertainty quantification. It employs conformal prediction to calibrate tool outputs and a prediction-set-inspired metric to quantify MLLM reasoning uncertainty, enabling autonomous ROI selection and robust, multi-stage reasoning. The approach yields average gains of about 4.6% across five diverse VQA-like datasets, with certain cases even surpassing fine-tuned baselines, underscoring the value of reliable tool use in MLLMs. This work advances practical deployment of multimodal reasoning by reducing data requirements while improving robustness to tool and model uncertainty, and it opens avenues for extending uncertainty-aware agentic reasoning to additional modalities.

Abstract

Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) but still face challenges in multimodal reasoning. Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. However, CoT-based multimodal reasoning often demands costly data annotation and fine-tuning, while agentic approaches relying on external tools risk introducing unreliable output from these tools. In this paper, we propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework that integrates external vision models with uncertainty quantification (UQ) into an MLLM to address these challenges. Specifically, SRICE guides the inference process by allowing MLLM to autonomously select regions of interest through multi-stage interactions with the help of external tools. We propose to use a conformal prediction-based approach to calibrate the output of external tools and select the optimal tool by estimating the uncertainty of an MLLM's output. Our experiment shows that the average improvement of SRICE over the base MLLM is 4.6% on five datasets and the performance on some datasets even outperforms fine-tuning-based methods, revealing the significance of ensuring reliable tool use in an MLLM agent.

Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework

TL;DR

The paper addresses the challenge of reliable multimodal reasoning with limited training and annotation by introducing SRICE, a training-free framework that couples external vision tools with uncertainty quantification. It employs conformal prediction to calibrate tool outputs and a prediction-set-inspired metric to quantify MLLM reasoning uncertainty, enabling autonomous ROI selection and robust, multi-stage reasoning. The approach yields average gains of about 4.6% across five diverse VQA-like datasets, with certain cases even surpassing fine-tuned baselines, underscoring the value of reliable tool use in MLLMs. This work advances practical deployment of multimodal reasoning by reducing data requirements while improving robustness to tool and model uncertainty, and it opens avenues for extending uncertainty-aware agentic reasoning to additional modalities.

Abstract

Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) but still face challenges in multimodal reasoning. Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. However, CoT-based multimodal reasoning often demands costly data annotation and fine-tuning, while agentic approaches relying on external tools risk introducing unreliable output from these tools. In this paper, we propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework that integrates external vision models with uncertainty quantification (UQ) into an MLLM to address these challenges. Specifically, SRICE guides the inference process by allowing MLLM to autonomously select regions of interest through multi-stage interactions with the help of external tools. We propose to use a conformal prediction-based approach to calibrate the output of external tools and select the optimal tool by estimating the uncertainty of an MLLM's output. Our experiment shows that the average improvement of SRICE over the base MLLM is 4.6% on five datasets and the performance on some datasets even outperforms fine-tuning-based methods, revealing the significance of ensuring reliable tool use in an MLLM agent.

Paper Structure

This paper contains 15 sections, 13 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of different multimodal reasoning methods on an example image. US refers to the uncertainty score. (a) Zero-shot VQA by the MLLM. (b) The MLLM calls an object detector for additional visual information. (c) Our proposed SRICE framework. The MLLM alone fails in zero-shot mode due to its limited vision recognition capacity. When relying on external models, the output can be unreliable and lead to an incorrect answer. In contrast, SRICE calibrates the outputs from external tools to ensure reliable visual information and estimates the uncertainty of MLLM outputs to select the most reliable tool, yielding the correct answer.
  • Figure 2: The proposed SRICE framework. The MLLM gives the wrong answer 'The beer is on the ground.' for this example. Instead, SRICE generates the correct answer through a two-stage process. Stage 1: SRICE calls external tools to obtain fine-grained information and applies CP-based calibration to their outputs to improve quality. This calibration mitigates issues such as the segmentation tool misclassifying many pixels as background and the object detection tool missing small objects. Based on the calibrated results, the MLLM selects regions of interest through a CoT process. Stage 2: The key area identified in Stage 1 is extracted and combined with the original image as the MLLM input to perform CoT reasoning. The best answer is chosen from all agentic pathways using our uncertainty estimation based on the prediction set size.
  • Figure 3: Visualization of some results. GT refers to the ground truth of the answer. Due to the limited space, We do not show interaction with MLLM, focusing on the UQ process in Stage 1 and Stage 2. Refer to Fig. \ref{['fig:main_fig']} for more details of the framework.
  • Figure 4: Reliability diagram of SRICE-seg-CP, SRICE-det-CP and SRICE on GQA dataset.
  • Figure 5: Visualization of more results.