Table of Contents
Fetching ...

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning

Xinpeng Ding, Kui Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Xiaomeng Li

TL;DR

This work addresses video hallucination in video-language models by moving from offline preference data to online, video-centric supervision. The core contribution, PaMi-VDPO, combines VDPO with prompt-aware multi-instance learning to construct a candidate set of augmented rejected videos and select the most prompt-relevant, semantically distinct clip for training, while down-weighting noisy samples. The approach eliminates the need for pre-constructed preference data and additional architectural changes, achieving state-of-the-art hallucination mitigation while maintaining or improving performance on general video benchmarks. Empirical results show a 5.3% improvement on VideoHallucer over baselines with 10k SFT data, beating GPT-4o, and robust generalization across datasets and backbones.

Abstract

Direct Preference Optimization (DPO) helps reduce hallucinations in Video Multimodal Large Language Models (VLLMs), but its reliance on offline preference data limits adaptability and fails to capture true video-response misalignment. We propose Video Direct Preference Optimization (VDPO), an online preference learning framework that eliminates the need for preference annotation by leveraging video augmentations to generate rejected samples while keeping responses fixed. However, selecting effective augmentations is non-trivial, as some clips may be semantically identical to the original under specific prompts, leading to false rejections and disrupting alignment. To address this, we introduce Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO), which selects augmentations based on prompt context. Instead of a single rejection, we construct a candidate set of augmented clips and apply a close-to-far selection strategy, initially ensuring all clips are semantically relevant while then prioritizing the most prompt-aware distinct clip. This allows the model to better capture meaningful visual differences, mitigating hallucinations, while avoiding false rejections, and improving alignment. PaMi-VDPOseamlessly integrates into existing VLLMs without additional parameters, GPT-4/human supervision. With only 10k SFT data, it improves the base model by 5.3% on VideoHallucer, surpassing GPT-4o, while maintaining stable performance on general video benchmarks.

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning

TL;DR

This work addresses video hallucination in video-language models by moving from offline preference data to online, video-centric supervision. The core contribution, PaMi-VDPO, combines VDPO with prompt-aware multi-instance learning to construct a candidate set of augmented rejected videos and select the most prompt-relevant, semantically distinct clip for training, while down-weighting noisy samples. The approach eliminates the need for pre-constructed preference data and additional architectural changes, achieving state-of-the-art hallucination mitigation while maintaining or improving performance on general video benchmarks. Empirical results show a 5.3% improvement on VideoHallucer over baselines with 10k SFT data, beating GPT-4o, and robust generalization across datasets and backbones.

Abstract

Direct Preference Optimization (DPO) helps reduce hallucinations in Video Multimodal Large Language Models (VLLMs), but its reliance on offline preference data limits adaptability and fails to capture true video-response misalignment. We propose Video Direct Preference Optimization (VDPO), an online preference learning framework that eliminates the need for preference annotation by leveraging video augmentations to generate rejected samples while keeping responses fixed. However, selecting effective augmentations is non-trivial, as some clips may be semantically identical to the original under specific prompts, leading to false rejections and disrupting alignment. To address this, we introduce Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO), which selects augmentations based on prompt context. Instead of a single rejection, we construct a candidate set of augmented clips and apply a close-to-far selection strategy, initially ensuring all clips are semantically relevant while then prioritizing the most prompt-aware distinct clip. This allows the model to better capture meaningful visual differences, mitigating hallucinations, while avoiding false rejections, and improving alignment. PaMi-VDPOseamlessly integrates into existing VLLMs without additional parameters, GPT-4/human supervision. With only 10k SFT data, it improves the base model by 5.3% on VideoHallucer, surpassing GPT-4o, while maintaining stable performance on general video benchmarks.

Paper Structure

This paper contains 52 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison of different methods.Left: Direct Preference Optimization (DPO) requires costly LLMs or human annotators to preconstruct preference data offline for different tasks, limiting its generalization. Meanwhile, its preference learning only applies to responses, ignoring the involvement of the video. Right: Our Prompt-aware Multi-instance learning Video DPO (PaMi-VDPO) overcomes these limitations by constructing a candidate rejected video set during training and automatically selects the appropriate rejected video based on the prompt to perform video-enhanced preference learning.
  • Figure 2: Impact of Augmentations.RA: Random video augmentation. (a) Performance gap with the base model (LLaVA-OV-7B li2024llava) and (b) DPO loss curve show that RA leads to performance degradation on general benchmarks and unstable training (red box). Performance gains and degradation are highlighted in green and -red, respectively. In contrast, our Prompt-aware Multi-instance learning Video DPO (PaMi-VDPO) achieves superior performance on both hallucination and general benchmarks while ensuring stable training.
  • Figure 3: (a) Video DPO (VDPO) Pipeline. VDPO applies augmentations to the original clip $\mathbf{V}^w$ to generate the rejected clip $\mathbf{V}^l$, then optimized by the VDPO objective (Eq. \ref{['e:vdpo']}). (b) Performance Gap Across Augmentations. The x-axis represents different augmentation strategies for generating rejected clips, while the y-axis shows the performance gap relative to the baseline (LLaVA-OV-7B li2024llava) across hallucination and general benchmarks (see Section \ref{['sec:setting']}). Gains and degradations are highlighted in green and -red, respectively. High-similarity augmentations retain semantic closeness to the original clip, whereas low-similarity ones introduce significant differences (see more in Section \ref{['sec:setting']}). Different augmentation strategies have a substantial impact on VDPO performance, analyzed further in Section \ref{['sec:findings']}.
  • Figure 4: (a) VDPO loss curve. The solid line depicts the average loss, with the shaded area showing the variance. VDPO loss trained with high-similarity augmentations exhibits greater variance, whereas low-similarity augmentations enable smoother convergence.(b) Prompt-aware augmentation. The impact of augmentations varies by prompts, i.e., the augmented rejected video clip may have the same response as the chosen clip for some prompts. We term these generated clips as false-rejected.
  • Figure 5: Prompt-aware augmentation stabilizes VDPO. Comparative analysis on both the performance gap (Left) and the loss curve (Right) reveals selecting appropriate augmentations based on the questions can help mitigate VDPO instability. Specifically, Shuffle-Clean and Shuffle-Mixed represent VDPO training with the Shuffle augmentation on Clean and Mixed data, respectively (see Section \ref{['sec:findings']} for details). Shuffle-Clean consistently reduces hallucination and maintains generalizability, whereas Shuffle-Mixed degrades both the general tasks and optimization stability.
  • ...and 5 more figures