Table of Contents
Fetching ...

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu, Di Wang

TL;DR

This work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment, and demonstrates that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities.

Abstract

Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

TL;DR

This work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment, and demonstrates that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities.

Abstract

Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
Paper Structure (50 sections, 3 figures, 8 tables)

This paper contains 50 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overview of the VSFA framework. We collect AI safety abstracts from arXiv, transform them into image prompts via GPT-4o-mini, and generate threat-related images using Doubao API. Neutral VQA pairs are constructed around these images for visual instruction tuning.
  • Figure 2: Ablation study on fine-tuning modality. We compare three variants: Image (threat-related images with neutral VQA), Text (text-only safety data), and Mixed (combination of both). (a) Attack Success Rate across four models. (b) Constructive Score across four models.
  • Figure 3: Bidirectional steering effects of top SAE latents. Bars below zero show ASR reduction from adding the latent to the original model. Bars above zero show ASR increase from removing it from the VSFA model. Latent #12 shows the strongest effect in both directions ($-$18%/+14%), confirming it as the primary safety-oriented persona latent.