Table of Contents
Fetching ...

Safe Vision-Language Models via Unsafe Weights Manipulation

Moreno D'Incà, Elia Peruzzo, Xingqian Xu, Humphrey Shi, Nicu Sebe, Massimiliano Mancini

TL;DR

This work tackles safety in Vision-Language Models by showing that training-based safety approaches can degrade safe-input safety and knowledge. It introduces SafeGround, a multi-granularity safety evaluation framework, and demonstrates that fine-tuning for safety (Safe-CLIP) can unintentionally erode safe representations. To address this, the authors propose Unsafe Weights Manipulation (UWM), a training-free method that identifies and adjusts unsafe weights by analyzing information flow differences between safe and unsafe content, often by negating targeted weights. Across several VLM architectures and even LLaVA, UWM delivers the best balance between safety improvements and knowledge preservation, outperforming training-based baselines in many scenarios. Overall, SafeGround provides a robust diagnostic toolkit, and UWM offers a practical, scalable path toward training-free safety for VLMs with broad applicability.

Abstract

Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.

Safe Vision-Language Models via Unsafe Weights Manipulation

TL;DR

This work tackles safety in Vision-Language Models by showing that training-based safety approaches can degrade safe-input safety and knowledge. It introduces SafeGround, a multi-granularity safety evaluation framework, and demonstrates that fine-tuning for safety (Safe-CLIP) can unintentionally erode safe representations. To address this, the authors propose Unsafe Weights Manipulation (UWM), a training-free method that identifies and adjusts unsafe weights by analyzing information flow differences between safe and unsafe content, often by negating targeted weights. Across several VLM architectures and even LLaVA, UWM delivers the best balance between safety improvements and knowledge preservation, outperforming training-based baselines in many scenarios. Overall, SafeGround provides a robust diagnostic toolkit, and UWM offers a practical, scalable path toward training-free safety for VLMs with broad applicability.

Abstract

Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.

Paper Structure

This paper contains 22 sections, 13 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: VLMs (e.g.,CLIP radford2021learning) exhibit unsafe behaviors. Training-based safety alignment methods (Safe-CLIP poppi2024removing) improve safety for unsafe queries but compromise safety on safe inputs and degrade the model’s knowledge. UWM improves safety while better preserving the model’s capabilities.
  • Figure 2: CLIP vs Safe-CLIP. The preference metrics $\mathtt{P_s^t}$ and $\mathtt{P_s^v}$ expose the degraded Safe-CLIP's safety on safe queries.
  • Figure 3: Unsafe Weights Manipulation (UWM) discovers unsafe weights of a given VLM and manipulates them to improve safety. Using safe and unsafe data within a single modality (e.g., the image modality shown in the center), UWM analyzes each encoder's activations to estimate layer-wise safe and unsafe scores, which are then aggregated. Lastly, a small percentage of the top-scoring weights is targeted and manipulated towards safety. Applied independently to each encoder, UWM prevents cross-modal interference during scoring.
  • Figure 4: Qualitative results for $\mathcal{T}_u\text{-}\mathcal{V}_{u+s}$ and $\mathcal{V}_u\text{-}\mathcal{T}_{u+s}$ tasks.
  • Figure 5: We ablate the sparsity $\tau$ and show the trend of four key metrics $\mathcal{V}_s\text{-}\mathcal{T}_s$, $\mathtt{Txt}_\mathtt{s}$, $\mathtt{Img}_\mathtt{s}$ and $\mathtt{GS}$. We also provide the original CLIP performance in - - - and the final chosen configuration in $\star$.
  • ...and 4 more figures