Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng; Jian Lou; Jun Feng; Minlie Huang; Wenjie Wang

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, Wenjie Wang

TL;DR

This work tackles the problem of safety alignment for Vision-Language Models under adversarial jailbreaks, where post-hoc methods underperform against white-box attacks. It introduces Adversary-aware Direct Preference Optimization (ADPO), a two-component framework that combines an adversarially trained reference model (AR-DPO) with an adversarial-aware DPO loss (AT-DPO) to address worst-case perturbations in both image space and latent space. Empirical results on LLaVA-1.5/1.6 and HarmBench-derived data show that ADPO markedly reduces attack success rates across diverse jailbreaks, with some trade-offs in normal-task utility that can be mitigated by hyperparameter tuning. The work provides a practical robustness enhancement for VLM safety, offering a principled integration of adversarial training into alignment objectives and illuminating the value of latent-space defenses in multimodal systems.

Abstract

Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose $\textit{Adversary-aware DPO (ADPO)}$, a novel training framework that explicitly considers adversarial. $\textit{Adversary-aware DPO (ADPO)}$ integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. $\textit{ADPO}$ introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, $\textit{ADPO}$ ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that $\textit{ADPO}$ outperforms baselines in the safety alignment and general utility of VLMs.

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

TL;DR

Abstract

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)