Table of Contents
Fetching ...

Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs

Wei Zhao, Zhe Li, Yige Li, Jun Sun

TL;DR

SafeCLIP presents a zero-shot, CLIP-alignment-based defense for LVLMs that repurposes the visual CLS token to detect toxic images without architectural changes. It enables dynamic safety corrections during both inference and fine-tuning, achieving a Defense Success Rate of 66.9% with 3.2% FPR and only 7.2% overhead on neutral inputs, outperforming prior methods. The approach leverages a safety concept bank in CLIP's text space and a simple, efficient decision rule, yielding strong safety with minimal runtime impact. This work offers a practical, scalable path to safer multimodal models, with potential for broader deployment and future extensions to adversarial and more diverse attack vectors.

Abstract

Large Vision-Language Models (LVLMs) have made significant strides in multimodal comprehension, thanks to extensive pre-training and fine-tuning on large-scale visual datasets. However, despite their robust textual safety mechanisms, they remain vulnerable to harmful visual inputs. Existing safeguards-typically relying on pre-filtering or fine-tuning-incur high costs and diminish overall utility. To address this critical vulnerability, we introduce SafeCLIP, a lightweight method that leverages LVLMs inherent multimodal alignment for zero-shot toxic image detection. By projecting CLIPs discarded CLS token into its text space and matching it with toxic descriptors, SafeCLIP detects harmful content without any architectural changes-adding minimal latency and enabling dynamic safety corrections during inference and fine-tuning.Experiments show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead. In contrast, state-of-the-art methods achieve 52.9% success but have a 10.7% false positive rate and 210% overhead. Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost LVLM safety. Code is available at anonymous.4open.science/r/safeclip-2C01.

Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs

TL;DR

SafeCLIP presents a zero-shot, CLIP-alignment-based defense for LVLMs that repurposes the visual CLS token to detect toxic images without architectural changes. It enables dynamic safety corrections during both inference and fine-tuning, achieving a Defense Success Rate of 66.9% with 3.2% FPR and only 7.2% overhead on neutral inputs, outperforming prior methods. The approach leverages a safety concept bank in CLIP's text space and a simple, efficient decision rule, yielding strong safety with minimal runtime impact. This work offers a practical, scalable path to safer multimodal models, with potential for broader deployment and future extensions to adversarial and more diverse attack vectors.

Abstract

Large Vision-Language Models (LVLMs) have made significant strides in multimodal comprehension, thanks to extensive pre-training and fine-tuning on large-scale visual datasets. However, despite their robust textual safety mechanisms, they remain vulnerable to harmful visual inputs. Existing safeguards-typically relying on pre-filtering or fine-tuning-incur high costs and diminish overall utility. To address this critical vulnerability, we introduce SafeCLIP, a lightweight method that leverages LVLMs inherent multimodal alignment for zero-shot toxic image detection. By projecting CLIPs discarded CLS token into its text space and matching it with toxic descriptors, SafeCLIP detects harmful content without any architectural changes-adding minimal latency and enabling dynamic safety corrections during inference and fine-tuning.Experiments show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead. In contrast, state-of-the-art methods achieve 52.9% success but have a 10.7% false positive rate and 210% overhead. Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost LVLM safety. Code is available at anonymous.4open.science/r/safeclip-2C01.

Paper Structure

This paper contains 29 sections, 9 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Multimodal processing pipeline in visual language models. Visual input $X_v$ is encoded into CLS token and features $Z_v$, which are projected to $H_v$. Text input $X_q$ is tokenized into $H_q$, concatenated with $H_v$, and processed by language model $F_\theta$ to generate response $Y_a$.
  • Figure 2: Efficiency Comparison: Average Performance on 100 Neutral and Toxic Image Requests
  • Figure 3: Openai Safety Judge Template
  • Figure 4: Template-2 for ablation study
  • Figure 5: Template-3 for ablation study
  • ...and 3 more figures