Table of Contents
Fetching ...

SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah

TL;DR

SafeR-CLIP addresses the safety-performance trade-off by relocating unsafe concepts to semantically closest safe alternatives, preserving pretrained geometry. It introduces relative cross-modal redirection and proximity-based alignment, plus a progressive training schedule. NSFWCaps provides a rigorous 1,000-pair benchmark for evaluation under distributional shift. Across retrieval, zero-shot classification, and generation, SafeR-CLIP achieves up to 8% better zero-shot accuracy than prior safety-finetuning methods while maintaining robust NSFW mitigation and reduced representational drift.

Abstract

Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

TL;DR

SafeR-CLIP addresses the safety-performance trade-off by relocating unsafe concepts to semantically closest safe alternatives, preserving pretrained geometry. It introduces relative cross-modal redirection and proximity-based alignment, plus a progressive training schedule. NSFWCaps provides a rigorous 1,000-pair benchmark for evaluation under distributional shift. Across retrieval, zero-shot classification, and generation, SafeR-CLIP achieves up to 8% better zero-shot accuracy than prior safety-finetuning methods while maintaining robust NSFW mitigation and reduced representational drift.

Abstract

Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

Paper Structure

This paper contains 18 sections, 16 equations, 4 figures, 19 tables.

Figures (4)

  • Figure 2: Overview of the SafeR-CLIP training framework (shown for the image encoder; a similar strategy is applied to the text encoder). Unsafe inputs are redirected toward their most semantically compatible safe counterpart—referred to as the Top-1 Proximal Embedding—based on cosine similarity in the pretrained space. The uni-modal directional loss aligns unsafe embeddings to their safe proximal targets within the same modality. The relative cross-modal loss encourages unsafe embeddings to move closer to their aligned safe target and away from their original unsafe representation. A preservation loss is applied to safe inputs to maintain generalization. Together, these components achieve robust safety alignment with minimal disruption to the pretrained representation space.
  • Figure 3: Overview of the NSFWCaps dataset. Unsafe captions and images are minimally modified versions of their safe counterparts, preserving the original context while introducing NSFW elements. This ensures tight semantic alignment and enables controlled evaluation of cross-modal safety.
  • Figure 4: Examples from the ViSU dataset reveal that many safe–unsafe pairs are weakly aligned, with mismatched actions, contexts, or visual content. This inconsistency introduces ambiguity during training and undermines the validity of evaluation results.
  • Figure 5: Relative L2 distance between fine-tuned and original CLIP weights for both text (left) and vision (right) encoders across training epochs. Our method shows lower weight deviation than Safe-CLIP, suggesting better preservation of pretrained representations.