Table of Contents
Fetching ...

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch

TL;DR

Understanding Alignment in Multimodal LLMs addresses image-grounded hallucinations by dissecting offline and online alignment strategies, scrutinizing multimodal preference data, and introducing Bias-Driven Hallucination Sampling (BDHS) as an annotation-free data generation method. The study shows that hybrid offline-online approaches can yield robust gains, and BDHS consistently matches or surpasses larger, more resource-intensive datasets across multiple benchmarks. It demonstrates that careful dataset construction and alignment objective choices are critical for improving visual faithfulness without sacrificing helpfulness. The work lays groundwork for cost-effective, scalable MLLM alignment and points to future directions in benchmark design and reward modeling.

Abstract

Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks.

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

TL;DR

Understanding Alignment in Multimodal LLMs addresses image-grounded hallucinations by dissecting offline and online alignment strategies, scrutinizing multimodal preference data, and introducing Bias-Driven Hallucination Sampling (BDHS) as an annotation-free data generation method. The study shows that hybrid offline-online approaches can yield robust gains, and BDHS consistently matches or surpasses larger, more resource-intensive datasets across multiple benchmarks. It demonstrates that careful dataset construction and alignment objective choices are critical for improving visual faithfulness without sacrificing helpfulness. The work lays groundwork for cost-effective, scalable MLLM alignment and points to future directions in benchmark design and reward modeling.

Abstract

Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks.
Paper Structure (42 sections, 9 equations, 11 figures, 15 tables, 1 algorithm)

This paper contains 42 sections, 9 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the alignment objective for multimodal LLMs. The alignment problem is formulated as reward maximization w.r.t. the parameters of the policy $\pi_\theta$. A regularization term provides a positive reward for staying close to a reference policy $\pi_\text{ref}$. For DAP methods, the reward maximization problem is transformed to a supervised learning problem by using a closed-form solution for preference pairs. Both online and offline methods draw from a prompt distribution $x \sim p$ that comprises text and image prompts. However, for the sampling distribution $(y^+, y^-) \sim \mu$, offline methods draw from a fixed preference dataset while online ones sample from the policy, $\pi_\theta$, and rank/score with an annotator or reward model.
  • Figure 2: Overview of the BDHS method including the optional iterative variant in gray. For each re-generation, both the image attention mask and the sentence split positions are resampled. The image is taken from the LLaVA Instruct dataset.
  • Figure 3: Example of generated responses for different hyperparameters and approaches. The image, prompt and SFT ground truth are taken from LLaVA Instruct. For guided generation, actual model completions are shown in bold face.
  • Figure 4: Upon analysis of the losses on POPE, we noticed close to 20% of cases where the ground truth was either incorrect or disputable. This example is from POPE, which sources images from COCO Lin_2014.
  • Figure 5: Examples illustrating instances where the CHAIR and Objet HalBench benchmarks fail to detect hallucinations. Text highlighted in green identifies hallucinations successfully detected by the benchmarks. In contrast, text highlighted in red indicates examples where the benchmark failed to identify hallucinations. Orange indicates hallucinations that, though not targeted by these benchmarks, degrade response quality. The top example shows the benchmark proposed by Yu_2023 while the bottom example follows from Rohrbach_2018. Images are from COCO Lin_2014.
  • ...and 6 more figures