Table of Contents
Fetching ...

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Chengzhi Yu, Yifan Xu, Yifan Chen, Wenyi Zhang

TL;DR

The paper tackles hallucination in large vision-language models by demonstrating that on-policy data substantially outperforms off-policy data for preference alignment. It introduces a hallucination-free chosen-sample pipeline and a robust iterative direct preference optimization (DPO) with dynamic sample weighting based on the Rao-Kupper model to focus learning on informative samples. Through extensive experiments on multiple LVLM benchmarks with LLaVA variants, the approach achieves substantial hallucination reductions and, in some cases, surpasses GPT-4V, highlighting its scalability and practical impact. The work advances principled on-policy data construction and stable alignment optimization for reliable multimodal grounding.

Abstract

Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

TL;DR

The paper tackles hallucination in large vision-language models by demonstrating that on-policy data substantially outperforms off-policy data for preference alignment. It introduces a hallucination-free chosen-sample pipeline and a robust iterative direct preference optimization (DPO) with dynamic sample weighting based on the Rao-Kupper model to focus learning on informative samples. Through extensive experiments on multiple LVLM benchmarks with LLaVA variants, the approach achieves substantial hallucination reductions and, in some cases, surpasses GPT-4V, highlighting its scalability and practical impact. The work advances principled on-policy data construction and stable alignment optimization for reliable multimodal grounding.

Abstract

Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.

Paper Structure

This paper contains 33 sections, 31 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustrative example of generation probability distribution after on/off-policy training. The correct token "2" is shown in green and the hallucinated token with the highest probability is shown in red. Panel (b) displays the hallucination mode of the reference model. Across the two training paradigms, off-policy training (panel (c), top) fails to overturn this dominant hallucination pattern. In contrast, on-policy training (panel (c), bottom) substantially increases the probability of the correct answer and effectively suppresses the dominant hallucinated token. The detailed analysis of this phenomenon is presented in \ref{['subsec:on-policy']}.
  • Figure 2: Overview of our framework. Our method consists of three steps: (1) Rollout: generating $N$ responses per image-prompt pair to form ⟨image, prompt, GT answer, response⟩ tuples; (2) Hallucination Judgement: selecting chosen and rejected samples based on hallucination probabilities from a trained classifier; (3) Sample-Weighted Iterative Alignment: fine-tuning the model using the preference dataset. These steps are repeated iteratively until the model converges.
  • Figure 3: Hallucination-Free Chosen Sample Selection Process. For each prompt, we generate $N$ responses and use a hallucination classifier to evaluate each response. Prompts where all responses are either entirely hallucinated or entirely hallucination-free are excluded from the training set.
  • Figure 4: Qualitative Results from MMHalBench. We highlight the correct and incorrect parts of the responses from different models using bold green and red text, respectively.
  • Figure 5: Ablation study on parameter $\nu$.

Theorems & Definitions (2)

  • Remark 4.1
  • proof