Table of Contents
Fetching ...

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan

TL;DR

This work tackles visual hallucinations in Vision-Language Models by introducing REVERSE, a unified framework that couples hallucination-aware training with retrospective verification and self-correction during decoding. It features explicit confidence tokens and a retrospective resampling procedure that backtracks, rewrites queries, and applies rejection sampling to refine outputs on the fly. A 1.3M semi-synthetic instruction-tuning dataset with 6.8M QA turns supports training, enabling the model to mark confident vs unconfident content and to correct itself iteratively. Empirically, REVERSE delivers state-of-the-art hallucination reduction on CHAIR-MSCOCO and HaloQuest benchmarks (up to 12% and 34% improvements, respectively) while maintaining competitive general VLM performance, and it provides a tunable balance between creativity and grounding through a threshold parameter $ au$.

Abstract

Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 34% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

TL;DR

This work tackles visual hallucinations in Vision-Language Models by introducing REVERSE, a unified framework that couples hallucination-aware training with retrospective verification and self-correction during decoding. It features explicit confidence tokens and a retrospective resampling procedure that backtracks, rewrites queries, and applies rejection sampling to refine outputs on the fly. A 1.3M semi-synthetic instruction-tuning dataset with 6.8M QA turns supports training, enabling the model to mark confident vs unconfident content and to correct itself iteratively. Empirically, REVERSE delivers state-of-the-art hallucination reduction on CHAIR-MSCOCO and HaloQuest benchmarks (up to 12% and 34% improvements, respectively) while maintaining competitive general VLM performance, and it provides a tunable balance between creativity and grounding through a threshold parameter .

Abstract

Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 34% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.

Paper Structure

This paper contains 36 sections, 5 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: REVERSE, our proposed training and decoding paradigm for hallucination reduction, enables a single VLM to both verify if it has generated a hallucination and then correct itself iteratively. When uncertainty is detected through the generation of a (</UN>), the model backtracks and regenerates until a confident phrase (</CN>) is found.
  • Figure 2: Our 1.3M semi-synthetic instruction-tuning dataset for hallucination-aware VLM training. We constructed the dataset by augmenting negative phrases from the original LLaVA-v1.5-665k Liu_2024_llava_v15 dataset. Our negative phrases span a diverse range, including attributes, objects, world entities, and novel scenes. Positive noun phrases are marked with <SPAN> and </CN>, while negative samples are enclosed with <SPAN> and </UN>, terminating immediately. Further details about our dataset creation and statistics can be found in \ref{['subsec:data_creation']} and \ref{['app:data']}.
  • Figure 3: Illustration of our retrospective resampling process. During inference, we monitor the hallucination-aware VLM’s generation. When the likelihood of the </UN> token surpasses a predefined threshold, we trigger backtracking to the most recent confident checkpoint (</CN>) and apply corrections using rejection sampling and query rewriting. This self-correction mechanism can be applied iteratively throughout the generation process.
  • Figure 4: Qualitative Examples of different Methods. When generating captions for an image, LLaVA, OPERA, and Woodpecker tend to hallucinate non-existing objects. REVERSE generates correct captions of similar length. Additional qualitative results are provided in \ref{['app:extra_qual']}.
  • Figure 5: This plot illustrates the trade-off between CHAIR $(\downarrow)$ and Coverage $(\uparrow)$ across different threshold values. REVERSE is the first controllable VLM allowing for such tradeoffs.
  • ...and 5 more figures