Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, Dongsheng Li
TL;DR
This work identifies a critical limitation of Direct Preference Optimization (DPO) for mitigating hallucinations in Large Vision-Language Models: its reliance on on-policy data, which makes learning from off-policy preferred responses ineffective due to large reverse KL-divergence penalties. To address this, the authors propose On-Policy Alignment (OPA)-DPO, a four-step framework that uses expert revisions to correct hallucinations and then aligns both the original and revised responses on-policy via LoRA-SFT to produce a suitable starting policy for DPO training. The resulting method achieves state-of-the-art hallucination reduction with only 4.8k training samples, outperforming previous approaches trained with significantly more data on benchmarks such as AMBER and Object-Hal, and demonstrating robust gains across 7B and 13B LVLMs. The approach combines three key components—Language Corrections, Image Focus Mechanism, and Anchored Preference—within a unified OPA-DPO loss, and is supported by comprehensive ablations, case studies, and comparisons to RLHF/RLAIF baselines, highlighting practical implications for data-efficient, reliable multimodal instruction-following models.
Abstract
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples. Our implementation is available at https://github.com/zhyang2226/OPA-DPO.
