Table of Contents
Fetching ...

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Yue Chang, Liqiang Jing, Xiaopeng Zhang, Yue Zhang

TL;DR

A unified framework, Dentist, is presented, which can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers and achieves a 13.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.

Abstract

Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

TL;DR

A unified framework, Dentist, is presented, which can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers and achieves a 13.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.

Abstract

Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.
Paper Structure (34 sections, 9 equations, 15 figures, 12 tables, 1 algorithm)

This paper contains 34 sections, 9 equations, 15 figures, 12 tables, 1 algorithm.

Figures (15)

  • Figure 1: An example image of hallucination. The generation of the model is partially inconsistent with the image, which we call perception hallucination and reasoning hallucination respectively.
  • Figure 2: An overview of the proposed method. The components using GPT are indicated in orange. The icons of open and closed eyes indicate whether the component is a pure text task or is related to an image. The black line represents the original part of LVLM. The blue line represents the forward path of the verification process, and the orange line represents the feedback path in the verification loop. The core point is to customize different methods of mitigating hallucinations by classifying the query. The reasoning section is used to mitigate the hallucinations caused by reasoning queries, while the perception section is used to mitigate the hallucinations caused by perception queries.
  • Figure 3: Results of verification loop
  • Figure 4: Two testing cases of Dentist . The left example shows a perception hallucination produced by LVLM, and the right example shows a reasoning hallucination produced by LVLM, and these hallucinations are eliminated by Dentist.
  • Figure 5: Prompt template for classification
  • ...and 10 more figures