Table of Contents
Fetching ...

Taming Object Hallucinations with Verified Atomic Confidence Estimation

Jiarui Liu, Weihao Xuan, Zhijing Jin, Mona Diab

TL;DR

TACO introduces a lightweight, four-stage framework to curb object hallucinations in multimodal language models by decomposing queries into atomic binary checks, paraphrasing them to improve robustness, and estimating confidence via self-consistency or self-confidence before refining answers with an LLM. It eliminates reliance on external vision experts by performing internal self-verification and calibrating certainty through either a black-box or gray-box aggregation of paraphrase responses. Across five benchmarks (POPE, MME, HallusionBench, AMBER, MM-Hal) and two state-of-the-art MLLMs (LLaVA-1.5-7B and CogVLM2), TACO consistently reduces hallucinations and improves confidence calibration, with self-confidence (gray-box) generally outperforming self-consistency (black-box). The work also analyzes bias reduction, the impact of query reformulations, and the limitations of handling negative questions, underscoring TACO’s practical potential for improving trustworthiness in multimodal perception tasks. Overall, TACO demonstrates that a self-verification, paraphrase-based calibration loop can meaningfully enhance the faithfulness of MLLM outputs without heavy reliance on external vision modules.

Abstract

Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.

Taming Object Hallucinations with Verified Atomic Confidence Estimation

TL;DR

TACO introduces a lightweight, four-stage framework to curb object hallucinations in multimodal language models by decomposing queries into atomic binary checks, paraphrasing them to improve robustness, and estimating confidence via self-consistency or self-confidence before refining answers with an LLM. It eliminates reliance on external vision experts by performing internal self-verification and calibrating certainty through either a black-box or gray-box aggregation of paraphrase responses. Across five benchmarks (POPE, MME, HallusionBench, AMBER, MM-Hal) and two state-of-the-art MLLMs (LLaVA-1.5-7B and CogVLM2), TACO consistently reduces hallucinations and improves confidence calibration, with self-confidence (gray-box) generally outperforming self-consistency (black-box). The work also analyzes bias reduction, the impact of query reformulations, and the limitations of handling negative questions, underscoring TACO’s practical potential for improving trustworthiness in multimodal perception tasks. Overall, TACO demonstrates that a self-verification, paraphrase-based calibration loop can meaningfully enhance the faithfulness of MLLM outputs without heavy reliance on external vision modules.

Abstract

Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.

Paper Structure

This paper contains 36 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of the TACO pipeline using a generative example across four steps. First, atomic facts are extracted from the query and the original answer, and each fact is framed as a binary atomic query. Second, each atomic query is reformulated into multiple semantically equivalent variations to mitigate the over-sensitivity of MLLMs to surface text. Third, the MLLM’s responses to these queries are aggregated, and confidence is estimated using either self-consistency (black-box) or self-confidence (gray-box) to select the more reliable answer. Finally, an LLM refines the MLLM’s initial response by incorporating the corrected atomic answers.
  • Figure 2: LLaVA-1.5-7B on MME.
  • Figure 3: CogVLM2 on MME.
  • Figure 5: Comparison of aggregation functions for self-confidence estimation on POPE using .