Table of Contents
Fetching ...

Towards Harmless Multimodal Assistants with Blind Preference Optimization

Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, Liqiang Nie

TL;DR

This work tackles safety in Multimodal Large Language Models by introducing MMSafe-PO, a high-quality, multimodal safety preference dataset derived from text-only human feedback through a modality-interpretation pipeline. It identifies modality co-defense and modality cheating as core safety phenomena and proposes Blind Preference Optimization (BPO), which augments Direct Preference Optimization (DPO) with blinded-input comparisons to strengthen visual–language alignment. Empirical results show substantial safety gains: DPO improves a base LLaVA by about 0.21 in safety rate, while BPO pushes it to roughly 0.89, with notable cross-domain robustness on MM-SafetyBench and HarmEval. The dataset and BPO approach collectively advance practical safety alignment for harmless multimodal assistants, enabling safer real-world deployment and broader evaluation across safety benchmarks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.

Towards Harmless Multimodal Assistants with Blind Preference Optimization

TL;DR

This work tackles safety in Multimodal Large Language Models by introducing MMSafe-PO, a high-quality, multimodal safety preference dataset derived from text-only human feedback through a modality-interpretation pipeline. It identifies modality co-defense and modality cheating as core safety phenomena and proposes Blind Preference Optimization (BPO), which augments Direct Preference Optimization (DPO) with blinded-input comparisons to strengthen visual–language alignment. Empirical results show substantial safety gains: DPO improves a base LLaVA by about 0.21 in safety rate, while BPO pushes it to roughly 0.89, with notable cross-domain robustness on MM-SafetyBench and HarmEval. The dataset and BPO approach collectively advance practical safety alignment for harmless multimodal assistants, enabling safer real-world deployment and broader evaluation across safety benchmarks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.

Paper Structure

This paper contains 20 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An example illustrating the necessity of safety alignment for MLLMs, where the green box represents a safe response and the red box indicates a harmful response to users.
  • Figure 2: Illustration of modality co-dense and modality cheating in MLLMs. The MLLM provides correct responses to the first two instructions but fails to answer the third instruction.
  • Figure 3: Overall pipeline for MMSafe-PO dataset construction.
  • Figure 4: (a) Illustration of the types of images used in multimodal instructions. (b) Distribution of conversation turns.
  • Figure 5: Hierarchical category analysis on the safety issues in the MMSafe-PO dataset.
  • ...and 5 more figures