Table of Contents
Fetching ...

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Yuxi Xie, Guanzhen Li, Xiao Xu, Min-Yen Kan

TL;DR

This work proposes Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time and indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.

Abstract

Large vision-language models (LVLMs) suffer from hallucination, resulting in misalignment between the output textual response and the input visual content. Recent research indicates that the over-reliance on the Large Language Model (LLM) backbone, as one cause of the LVLM hallucination, inherently introduces bias from language priors, leading to insufficient context attention to the visual inputs. We tackle this issue of hallucination by mitigating such over-reliance through preference learning. We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time. To interpret the effectiveness and generalizability of V-DPO on different types of training data, we construct a synthetic dataset containing both response- and image-contrast preference pairs, compared against existing human-annotated hallucination samples. Our approach achieves significant improvements compared with baseline methods across various hallucination benchmarks. Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context. Our code is publicly available at https://github.com/YuxiXie/V-DPO.

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

TL;DR

This work proposes Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time and indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.

Abstract

Large vision-language models (LVLMs) suffer from hallucination, resulting in misalignment between the output textual response and the input visual content. Recent research indicates that the over-reliance on the Large Language Model (LLM) backbone, as one cause of the LVLM hallucination, inherently introduces bias from language priors, leading to insufficient context attention to the visual inputs. We tackle this issue of hallucination by mitigating such over-reliance through preference learning. We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time. To interpret the effectiveness and generalizability of V-DPO on different types of training data, we construct a synthetic dataset containing both response- and image-contrast preference pairs, compared against existing human-annotated hallucination samples. Our approach achieves significant improvements compared with baseline methods across various hallucination benchmarks. Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context. Our code is publicly available at https://github.com/YuxiXie/V-DPO.

Paper Structure

This paper contains 37 sections, 1 theorem, 19 equations, 9 figures, 8 tables.

Key Result

Proposition 1

$\exists M < \infty$, for any $y\sim\pi(y\mid v,x)$, the ratio of $\frac{\pi(y\mid x)}{\pi(y\mid v,x)}$ is bounded by $M$

Figures (9)

  • Figure 1: (a) Hallucination examples in visual question answering and region descriptions and (b) the model discriminative ability on the accurate and hallucinatory samples represented by difference in log-likelihoods.
  • Figure 2: Outline of our preference data construction and vision-guided preference learning framework. In the stage of Synthetic Data Augmentation, we utilize LVLMs, LLMs, and Stable Diffusion to manipulate images automatically. We formulate the generated samples into image- and response-contrast pairs for preference learning via our Vision-guided DPO approach.
  • Figure 3: Meso-analysis on MMHal-Bench results comparing performance in different splits of question types.
  • Figure 4: MMHal-Bench results on hallucination rate (Hal) and overall GPT-4 score.
  • Figure 4: Performance curves (CHAIR$_{\downarrow}$ and F1$_{\uparrow}$) on AMBER with the change of the visual guidance weight $\gamma$.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 1