Table of Contents
Fetching ...

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

Abstract

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Abstract

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

Paper Structure

This paper contains 32 sections, 6 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Schematic overview of safety mirage findings of safety fine-tuned VLM (LLaVA-v1.5-7B-Mixed fine-tuned on VLGuard zong2024safety) (a) One-word attack vulnerability: A minor modification (e.g., replacing the first instruction word "Share" with "What" in an unsafe query) bypasses the safety mechanism learned from fine-tuning on VLGuard, even though the model correctly rejects the original query. (b) Over-prudence issue: Similar to the one-word attack, a minor modification by replacing "What" with "Share" can cause unnecessary refusals even for benign queries. (c) Root cause of spurious correlations from fine-tuning dataset biases: Certain words are disproportionately linked with specific safety labels, such as "Share" strongly correlated with rejection, while "What" highly associated with non-rejection. (d) Effectiveness of unlearning-based safety fine-tuning: The unlearning methods NPO zhang2024negative and RMU li2024wmdp, enhance robustness against attacks and reduce over-prudence, outperforming both the original model LLaVA-v1.5-7B ("Original") and the supervised fine-tuned LLaVA-v1.5-7B-Mixed zong2024safety ("Prior work").
  • Figure 2: Visualization of question-answer samples on the safety fine-tuned VLMs (LLaVA-v1.5-7B-Mixed and LLaVA-v1.5-7B-Posthoc zong2024safetytaori2023stanford). Green shieldrepresents the correct response, either a safe rejection for harmful queries or a valid answer for benign queries. Red exclamation ! indicates an unsafe response to harmful queries. And red question ? represents an inappropriate rejection for a safe query. (a) Successfully jailbreak: The safety fine-tuned model originally produces rejection-based responses for unsafe queries. However, asking the initial question by starting with the word with "What" can easily bypass this safeguard. (b) Over-prudence: A minor modification by asking the problem start from "What" with "Share" can trigger unnecessary refusals even for benign queries.
  • Figure 3: Frequency of question-initiating words across training corpora: (a) VLGuard safe queries with non-rejection responses, (b) VLGuard unsafe queries with rejection responses, (c) SPA-VL queries.
  • Figure 4: ASR of $K$-shot one-word attack for varying $K$, evaluated before and after applying the "What"-initialized one-word attack to jailbreak the safety fine-tuned VLM (LLaVA-v1.5-7B-Mixed zong2024safety) on VLGuard unsafe.
  • Figure 5: Rejection rate vs. $K$-shot one-word modification, evaluated before and after applying the "Share" initialized one-word modification to safe input queries, causes the over-prudence phenomenon of the safety fine-tuned VLM LLaVA-v1.5-7B-Mixed model zong2024safety on VLGuard safe.
  • ...and 7 more figures