Table of Contents
Fetching ...

Understanding and Rectifying Safety Perception Distortion in VLMs

Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin

TL;DR

This paper investigates why vision-language models exhibit weaker safety alignment than their text-only backbones, identifying a modality-induced activation shift that biases activations toward a safer region and reduces detection of unsafe inputs. It introduces Activation Shift Disentanglement and Calibration (ShiftDC), a training-free, inference-time method that disentangles and removes the safety-relevant component of the shift while retaining safety-irrelevant visual information. By computing the safety direction in activation space and applying a calibrated shift after projecting out the safety-relevant component, ShiftDC restores the backbone's safety alignment without sacrificing vision-language utility. Empirical results across multiple VLMs and safety/utility benchmarks show that ShiftDC significantly improves safety alignment, reduces jailbreak success, and maintains strong visual reasoning performance, with modest inference-time overhead.

Abstract

Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC restores the inherent safety alignment of the LLM backbone while preserving the vision-language capabilities of VLMs. Empirical results demonstrate that ShiftDC significantly enhances alignment performance on safety benchmarks without impairing model utility.

Understanding and Rectifying Safety Perception Distortion in VLMs

TL;DR

This paper investigates why vision-language models exhibit weaker safety alignment than their text-only backbones, identifying a modality-induced activation shift that biases activations toward a safer region and reduces detection of unsafe inputs. It introduces Activation Shift Disentanglement and Calibration (ShiftDC), a training-free, inference-time method that disentangles and removes the safety-relevant component of the shift while retaining safety-irrelevant visual information. By computing the safety direction in activation space and applying a calibrated shift after projecting out the safety-relevant component, ShiftDC restores the backbone's safety alignment without sacrificing vision-language utility. Empirical results across multiple VLMs and safety/utility benchmarks show that ShiftDC significantly improves safety alignment, reduces jailbreak success, and maintains strong visual reasoning performance, with modest inference-time overhead.

Abstract

Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC restores the inherent safety alignment of the LLM backbone while preserving the vision-language capabilities of VLMs. Empirical results demonstrate that ShiftDC significantly enhances alignment performance on safety benchmarks without impairing model utility.

Paper Structure

This paper contains 28 sections, 8 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Vision-language inputs cause a modality-induced activation shift, steering VLM activations toward a “safer” direction compared to text-only inputs. This makes the VLM perceive inputs as less risky than they actually are, weakening its safety alignment.
  • Figure 2: Examples of constructed datasets.
  • Figure 3: Safety classification accuracy by probing per layer.
  • Figure 4: Confusion matrices of safety-probing classifiers trained on text-only $\mathcal{D}_\text{tt}$ and tested on vision-language $\mathcal{D}_\text{vl}$.
  • Figure 5: t-SNE visualization of the model's last token activations on $\mathcal{D}_\text{tt}^\text{safe}$, $\mathcal{D}_\text{tt}^\text{unsafe}$, $\mathcal{D}_\text{vl}^\text{safe}$, and $\mathcal{D}_\text{vl}^\text{unsafe}$. The red line indicates the boundary between text-only safe samples and unsafe samples.
  • ...and 5 more figures