Table of Contents
Fetching ...

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

TL;DR

LVLMs suffer from hallucinations that limit real-world deployment. CLIP-DPO tackles this by ranking self-generated captions with a pre-trained CLIP model to form positive/negative pairs for Direct Preference Optimization, all without paid APIs or external LVLMs, and with robust rule-based filtering. The method is evaluated on MobileVLM-v2 and LlaVA-1.5 7B, showing significant hallucination reduction and improved zero-shot grounding while preserving standard LVLM benchmarks; it also outperforms HA-DPO and large-data baselines on AMBER. The approach is lightweight, scalable, and offers practical grounding improvements for LVLMs in real-world settings, with potential extensions to other VL benchmarks and scorer choices, as captured by the objective formulation $\max_{{\pi_\theta}} \mathbb{E}_{(x_i,y_i^+,y_i^-)\sim\mathcal{D}} \log \sigma\left(\beta \log \frac{\pi_{\theta}( y_i^+ \mid x_i)}{\pi_{\mathrm{ref}}( y_i^+ \mid x_i)} - \beta \log \frac{\pi_{\theta}(y_i^- \mid x_i)}{\pi_{\mathrm{ref}}(y_i^- \mid x_i)}\right)$.

Abstract

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

TL;DR

LVLMs suffer from hallucinations that limit real-world deployment. CLIP-DPO tackles this by ranking self-generated captions with a pre-trained CLIP model to form positive/negative pairs for Direct Preference Optimization, all without paid APIs or external LVLMs, and with robust rule-based filtering. The method is evaluated on MobileVLM-v2 and LlaVA-1.5 7B, showing significant hallucination reduction and improved zero-shot grounding while preserving standard LVLM benchmarks; it also outperforms HA-DPO and large-data baselines on AMBER. The approach is lightweight, scalable, and offers practical grounding improvements for LVLMs in real-world settings, with potential extensions to other VL benchmarks and scorer choices, as captured by the objective formulation .

Abstract

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.
Paper Structure (18 sections, 1 equation, 8 figures, 6 tables)

This paper contains 18 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An example of the injected hallucinations. Given an image and its caption, we prompt GPT-4 to generate 3 types of hallucination: existence, attributes, and relation.
  • Figure 2: In dark blue, we show the number of hallucinated captions per type that LLaVA-1.5 7B assigns a higher likelihood than the original out of 1K captions. Light blue shows the portion of samples corrected by CLIP.
  • Figure 3: CLIP-DPO. Starting from the initial SFT data pool and a set of prompts, an LVLM generates captions. These captions are first ranked using a CLIP model, then filtered to identify the most suitable positive and negative pairs for DPO-based optimization.
  • Figure 4: The distribution of CLIP image-text scores per category.
  • Figure 5: Examples of generated generic captions produced by MobileVLM-v2 1.7B and MobileVLM-v2 3B. For MobileVLM-v2 3B's generated captions, we also show the produced question and positive and negative answers generated using Mistral 7B.
  • ...and 3 more figures