Table of Contents
Fetching ...

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, Dongsheng Li

TL;DR

This work identifies a critical limitation of Direct Preference Optimization (DPO) for mitigating hallucinations in Large Vision-Language Models: its reliance on on-policy data, which makes learning from off-policy preferred responses ineffective due to large reverse KL-divergence penalties. To address this, the authors propose On-Policy Alignment (OPA)-DPO, a four-step framework that uses expert revisions to correct hallucinations and then aligns both the original and revised responses on-policy via LoRA-SFT to produce a suitable starting policy for DPO training. The resulting method achieves state-of-the-art hallucination reduction with only 4.8k training samples, outperforming previous approaches trained with significantly more data on benchmarks such as AMBER and Object-Hal, and demonstrating robust gains across 7B and 13B LVLMs. The approach combines three key components—Language Corrections, Image Focus Mechanism, and Anchored Preference—within a unified OPA-DPO loss, and is supported by comprehensive ablations, case studies, and comparisons to RLHF/RLAIF baselines, highlighting practical implications for data-efficient, reliable multimodal instruction-following models.

Abstract

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples. Our implementation is available at https://github.com/zhyang2226/OPA-DPO.

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

TL;DR

This work identifies a critical limitation of Direct Preference Optimization (DPO) for mitigating hallucinations in Large Vision-Language Models: its reliance on on-policy data, which makes learning from off-policy preferred responses ineffective due to large reverse KL-divergence penalties. To address this, the authors propose On-Policy Alignment (OPA)-DPO, a four-step framework that uses expert revisions to correct hallucinations and then aligns both the original and revised responses on-policy via LoRA-SFT to produce a suitable starting policy for DPO training. The resulting method achieves state-of-the-art hallucination reduction with only 4.8k training samples, outperforming previous approaches trained with significantly more data on benchmarks such as AMBER and Object-Hal, and demonstrating robust gains across 7B and 13B LVLMs. The approach combines three key components—Language Corrections, Image Focus Mechanism, and Anchored Preference—within a unified OPA-DPO loss, and is supported by comprehensive ablations, case studies, and comparisons to RLHF/RLAIF baselines, highlighting practical implications for data-efficient, reliable multimodal instruction-following models.

Abstract

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples. Our implementation is available at https://github.com/zhyang2226/OPA-DPO.
Paper Structure (39 sections, 10 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 39 sections, 10 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a) OPA-DPO motivation: Naive adoption of DPO struggles to learn off-policy preferred responses due to the substantial reverse KL-divergence constraint (induced by unmatched supports). Our OPA operation aligns these responses on-policy, enabling effective learning with subsequent DPO. (b) Data scale vs. performance: We present the AMBER hallucination rates for various DPO-based algorithms and their training data volume. OPA-DPO (star markers) achieves SOTA performance with minimal amount of data. (c) Impact of OPA: Using LLaVA-1.5-13B with 4.8k data, we evaluate performance of DPO with/without OPA operations. The inclusion of OPA significantly enhances performance compared to DPO alone.
  • Figure 2: We categorize existing DPO-based algorithms for addressing hallucination issues in LVLMs into 3 classes: (1) Hallucination Injection (POVID zhou2024povid and HALVA sarkar2024halva). The ground-truth response is preferred, while the rejected response contains injected hallucinations. Since the errors do not originate from the model itself, the policy is unlikely to benefit from training. (2) Hallucination Recognition (RLHF-V yu2024rlhfv, HA-DPO zhao2023hadpo and HSA-DPO xiao2024hsadpo). The model generates responses, after which experts (AI or human) identify errors and make revisions. The off-policy nature of the revised responses makes them challenging to learn effectively. (3) Self Evolution (RLAIF-V yu2024rlaif). Both preferred and rejected responses are generated by the initial policy. A superior model assesses hallucinations, preferring the response with fewer errors. However, hallucinations may exist in both responses, thereby affecting the learning efficiency.
  • Figure 3: Our proposed OPA-DPO comprises four essential steps: ① Collect responses from the original policy based on the images and corresponding prompts. ② Utilize GPT-4V to correct any hallucinations in the generated responses with minimal modifications. ③ Conduct LoRA-SFT on the GT responses and revised responses. ④ Initiate OPA-DPO training from the policy obtained in step 3.
  • Figure 4: Distribution of response-averaged log probabilities for 200 significantly revised responses across different models.
  • Figure 5: Impact of data amount on hallucination-rate metrics.
  • ...and 9 more figures