Table of Contents
Fetching ...

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, Wenjie Wang

TL;DR

This work tackles the problem of safety alignment for Vision-Language Models under adversarial jailbreaks, where post-hoc methods underperform against white-box attacks. It introduces Adversary-aware Direct Preference Optimization (ADPO), a two-component framework that combines an adversarially trained reference model (AR-DPO) with an adversarial-aware DPO loss (AT-DPO) to address worst-case perturbations in both image space and latent space. Empirical results on LLaVA-1.5/1.6 and HarmBench-derived data show that ADPO markedly reduces attack success rates across diverse jailbreaks, with some trade-offs in normal-task utility that can be mitigated by hyperparameter tuning. The work provides a practical robustness enhancement for VLM safety, offering a principled integration of adversarial training into alignment objectives and illuminating the value of latent-space defenses in multimodal systems.

Abstract

Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose $\textit{Adversary-aware DPO (ADPO)}$, a novel training framework that explicitly considers adversarial. $\textit{Adversary-aware DPO (ADPO)}$ integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. $\textit{ADPO}$ introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, $\textit{ADPO}$ ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that $\textit{ADPO}$ outperforms baselines in the safety alignment and general utility of VLMs.

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

TL;DR

This work tackles the problem of safety alignment for Vision-Language Models under adversarial jailbreaks, where post-hoc methods underperform against white-box attacks. It introduces Adversary-aware Direct Preference Optimization (ADPO), a two-component framework that combines an adversarially trained reference model (AR-DPO) with an adversarial-aware DPO loss (AT-DPO) to address worst-case perturbations in both image space and latent space. Empirical results on LLaVA-1.5/1.6 and HarmBench-derived data show that ADPO markedly reduces attack success rates across diverse jailbreaks, with some trade-offs in normal-task utility that can be mitigated by hyperparameter tuning. The work provides a practical robustness enhancement for VLM safety, offering a principled integration of adversarial training into alignment objectives and illuminating the value of latent-space defenses in multimodal systems.

Abstract

Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose , a novel training framework that explicitly considers adversarial. integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that outperforms baselines in the safety alignment and general utility of VLMs.

Paper Structure

This paper contains 24 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Safe response rate under white-box and black-box attacks on LLaVA-1.5. Post-hoc safety fine-tuning (SFT and DPO) is less effective on white-box attack.
  • Figure 2: Pipeline of ADPO: achieving adversarail-aware safety alignment with adversarial-trained reference model and adversarial-aware DPO loss. The worst-case perturbation is generated on image space or the latent space of image-text embedding.
  • Figure 3: Safety-utility trade-off, where jailbreak dimensions indicate the ASR reduction (the larger the better). A larger area for each method represents more effective in safety alignment and utility maintainness.
  • Figure 4: Visualization of representation space of LLaVA-1.5 trained with ADPO, its ablations and FT. (1) Harmbench queries (green) are closer to the harmful anchor cluster (yellow) , demonstrating the model's success in recognizing their harmfulness. (2) LLaVA-1.5 trained with ADPO and its ablations successfully moves the orange cluster closer to the harmful (yellow) and HarmBench (green) clusters (black arrow) while pushing it further from the harmless cluster (blue, red arrow), indicates that the safety aligned model can better recognize the harmfulness in Harmbench queries even with the existence of jailbreak attacks.
  • Figure 5: Ablation study on adversarial training $\alpha$.
  • ...and 1 more figures