Table of Contents
Fetching ...

DAMA: Data- and Model-aware Alignment of Multi-modal LLMs

Jinda Lu, Junkang Wu, Jinghan Li, Xiaojun Jia, Shuo Wang, YiFan Zhang, Junfeng Fang, Xiang Wang, Xiangnan He

TL;DR

DAMA addresses the problem that direct preference optimization (DPO) for multimodal LLM alignment responds unevenly to data hardness. It introduces data-aware and model-aware strategies that modulate learning via adaptive adjustments to the optimization signal, using CLIP-based hardness estimates and real-time reward gaps. Across five benchmarks, DAMA yields strong improvements in trustworthiness and effectiveness, with notable reductions in hallucinations (e.g., Object HalBench) and competitiveness against GPT-4V. The work advances robust, human-preference-aligned multimodal models and highlights practical pathways to reduce hallucinations in real-world settings.

Abstract

Direct Preference Optimization (DPO) has shown effectiveness in aligning multi-modal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMA) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMA enables the model to effectively adapt to data with varying levels of hardness. Extensive experiments on five benchmarks demonstrate that DAMA not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object-HalBench, our DAMA-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.

DAMA: Data- and Model-aware Alignment of Multi-modal LLMs

TL;DR

DAMA addresses the problem that direct preference optimization (DPO) for multimodal LLM alignment responds unevenly to data hardness. It introduces data-aware and model-aware strategies that modulate learning via adaptive adjustments to the optimization signal, using CLIP-based hardness estimates and real-time reward gaps. Across five benchmarks, DAMA yields strong improvements in trustworthiness and effectiveness, with notable reductions in hallucinations (e.g., Object HalBench) and competitiveness against GPT-4V. The work advances robust, human-preference-aligned multimodal models and highlights practical pathways to reduce hallucinations in real-world settings.

Abstract

Direct Preference Optimization (DPO) has shown effectiveness in aligning multi-modal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMA) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMA enables the model to effectively adapt to data with varying levels of hardness. Extensive experiments on five benchmarks demonstrate that DAMA not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object-HalBench, our DAMA-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.

Paper Structure

This paper contains 16 sections, 15 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: (1) Preference data (Prompt, Image, Preferred response $y_w$, Rejected response $y_l$) with different hardness: "easy-to-distinguish" data denotes a large Image-Text sim score gap between $y_l$ and $y_w$; "hard-to-distinguish" data indicates a low score gap between $y_l$ and $y_w$. (2) Implicit reward across the optimization stage: the reward gap for "easy-to-distinguish" data enhances significantly during optimization, while for "hard-to-distinguish" data, the gap remains low.
  • Figure 2: Overview of our data-aware preference optimization. For each preference instance: (1) We first break the preferred and rejected response into sub-sentences by prompting a large language model (LLM); (2) Next, we estimate the similarity scores between each sub-sentence and the given image using the CLIP classifier, and then calculate the differences between the preferred and rejected response as the hardness of the data; (3) Finally, we incorporate the estimated hardness into the preference optimization process by modifying $\beta$ in Equ \ref{['equ:dpo']}, allowing the model to adjust based on the data hardness.
  • Figure 3: Overview of our model-aware preference optimization. Given $N$ preference instances: (1) we first calculate the reward gap of each instance using the implicit reward model; (2) To ensure stable modeling, we filter out the outliers (i.e. the instance with excessively high or low gaps) and then estimate the average gap; (3) To enable the model to be aware of its current responsiveness, we integrate such estimation into the preference optimization process by modifying $\beta$ in Equ \ref{['equ:dpo']}.
  • Figure 4: Experimental results of the combination strategies with the response-level non-hallucination rates.
  • Figure 5: Experimental results of the combination strategies with the mentioned-level non-hallucination rates.
  • ...and 2 more figures