Table of Contents
Fetching ...

Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, Jingbo Shang

TL;DR

This work tackles hallucinations in multimodal direct preference optimization by introducing MISP-DPO, a framework that learns from semantically diverse multi-negative samples. It identifies negatives through a sparse autoencoder in CLIP space and integrates them using a Plackett-Luce ranking objective with importance sampling for efficiency. Across five benchmarks and multiple model backbones, MISP-DPO yields superior multimodal alignment and substantial hallucination reductions, outperforming prior single-negative DPO and related methods. The approach offers scalable, interpretable supervision by surfacing latent semantic deviations and leveraging efficient sampling to improve vision-language alignment in practice.

Abstract

Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.

Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

TL;DR

This work tackles hallucinations in multimodal direct preference optimization by introducing MISP-DPO, a framework that learns from semantically diverse multi-negative samples. It identifies negatives through a sparse autoencoder in CLIP space and integrates them using a Plackett-Luce ranking objective with importance sampling for efficiency. Across five benchmarks and multiple model backbones, MISP-DPO yields superior multimodal alignment and substantial hallucination reductions, outperforming prior single-negative DPO and related methods. The approach offers scalable, interpretable supervision by surfacing latent semantic deviations and leveraging efficient sampling to improve vision-language alignment in practice.

Abstract

Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.

Paper Structure

This paper contains 19 sections, 1 theorem, 17 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Lemma 4.1

Defining the preference advantage of each negative image and the preference distribution as Then the gradient of equation eq:sdpo-image-loss decomposes as, where $\Delta_\theta(m_n^i,m_p\mid x,y_p) =\nabla_\theta \log\pi_\theta(y_p\mid x,m_n^i) -\nabla_\theta \log\pi_\theta(y_p\mid x,m_p)$.

Figures (5)

  • Figure 1: Overview of the MISP-DPO framework, which integrates CLIP encoding and sparse autoencoder–guided selection to identify diverse negatives for multi-negative preference optimization.
  • Figure 2: t-SNE visualizations and benchmark results for negative sampling. Left: importance-sampled negatives selected using SAE scoring exhibit broad semantic dispersion across three selections. Middle: randomly sampled negatives form tight, low-diversity clusters. Right: performance across benchmarks with different numbers of negatives selected by MISP-DPO.
  • Figure 3: Negative image retrieval using our MISP-DPO method. Each row shows a chosen image and three negatives; highlighted phrases in red, blue and green mark mismatches with negatives.
  • Figure 4: Performance comparison of Score and Hallucination Rate across different $\beta$ values for on MMHalBench.
  • Figure 5: Comparison of responses from LLaVA model training through different methods: pretrained, DPO, CHiP, and MISP-DPO. Blue text indicates faithful descriptions; red text marks hallucinated or unsupported content.

Theorems & Definitions (1)

  • Lemma 4.1: Gradient Decomposition