Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Chengzhi Yu; Yifan Xu; Yifan Chen; Wenyi Zhang

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Chengzhi Yu, Yifan Xu, Yifan Chen, Wenyi Zhang

TL;DR

The paper tackles hallucination in large vision-language models by demonstrating that on-policy data substantially outperforms off-policy data for preference alignment. It introduces a hallucination-free chosen-sample pipeline and a robust iterative direct preference optimization (DPO) with dynamic sample weighting based on the Rao-Kupper model to focus learning on informative samples. Through extensive experiments on multiple LVLM benchmarks with LLaVA variants, the approach achieves substantial hallucination reductions and, in some cases, surpasses GPT-4V, highlighting its scalability and practical impact. The work advances principled on-policy data construction and stable alignment optimization for reliable multimodal grounding.

Abstract

Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

TL;DR

Abstract

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (2)