CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
TL;DR
LVLMs suffer from hallucinations that limit real-world deployment. CLIP-DPO tackles this by ranking self-generated captions with a pre-trained CLIP model to form positive/negative pairs for Direct Preference Optimization, all without paid APIs or external LVLMs, and with robust rule-based filtering. The method is evaluated on MobileVLM-v2 and LlaVA-1.5 7B, showing significant hallucination reduction and improved zero-shot grounding while preserving standard LVLM benchmarks; it also outperforms HA-DPO and large-data baselines on AMBER. The approach is lightweight, scalable, and offers practical grounding improvements for LVLMs in real-world settings, with potential extensions to other VL benchmarks and scorer choices, as captured by the objective formulation $\max_{{\pi_\theta}} \mathbb{E}_{(x_i,y_i^+,y_i^-)\sim\mathcal{D}} \log \sigma\left(\beta \log \frac{\pi_{\theta}( y_i^+ \mid x_i)}{\pi_{\mathrm{ref}}( y_i^+ \mid x_i)} - \beta \log \frac{\pi_{\theta}(y_i^- \mid x_i)}{\pi_{\mathrm{ref}}(y_i^- \mid x_i)}\right)$.
Abstract
Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.
