Table of Contents
Fetching ...

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Xingyu Zhu, Beier Zhu, Junfeng Fang, Shuo Wang, Yin Zhang, Xiang Wang, Xiangnan He

TL;DR

GuardAlign is proposed, a training-free defense framework that integrates two strategies, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost.

Abstract

Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

TL;DR

GuardAlign is proposed, a training-free defense framework that integrates two strategies, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost.

Abstract

Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.
Paper Structure (26 sections, 1 theorem, 33 equations, 5 figures, 11 tables)

This paper contains 26 sections, 1 theorem, 33 equations, 5 figures, 11 tables.

Key Result

Theorem 1

For classifying image patches as safe ($y=0$) or unsafe ($y=1$), the classification error using Optimal Transport (OT) distance is less than or equal to that using cosine similarity: $P_{\text{error}}^{\text{OT}} \leq P_{\text{error}}^{\text{cos}}$, with equality when OT weights are uniform.

Figures (5)

  • Figure 1: Comparison between existing strategies eta and ours. (a) Similarity scores overlap between safe and unsafe images, while OT-based transport costs yield clear separation for reliable detection. (b) Prefix attention decays in the existing strategies (orange) but remains stable with ours (blue). (c) Example from SPA-VL SPA-VL showing that existing methods generate harmful content, whereas ours maintains safety. (d) Our method achieves consistent safety gains across diverse benchmarks.
  • Figure 2: Framework of GuardAlign. OT-Enhanced Safety Detection: image patches and predefined unsafe prompt categories are jointly encoded, and optimal transport is used to identify patches that align with harmful semantics. The most suspicious patches are masked to produce a sanitized image. Cross-Modal Attention Calibration: a lightweight safety prefix is added to the query, and the multimodal model attends over the sanitized visual tokens. This design guides the model toward safe evidence and prevents unsafe generations.
  • Figure 3: Comparison of safe and unsafe patch feature distributions using different distance metrics. (a): OT distance effectively separates the two distributions. (b): Cosine distance provides a less distinct separation between safe and unsafe patches.
  • Figure 4: Analysis of factors affecting safety and utility. (a) Varying $\tau$ shows a trade-off: smaller values lower USR with higher helpfulness, while larger ones improve robustness at some cost. (b) More patches reduce USR while keeping helpfulness stable, as finer partitioning exposes hidden malicious semantics. (c) Across CLIP backbones from RN50 to SigLIP, our method lowers harm rates while preserving VQA accuracy, showing robustness to encoder variations.
  • Figure 5: Examples of mask-guided alignment. Unsafe regions in the original image are automatically detected and masked.

Theorems & Definitions (2)

  • Theorem 1
  • proof