Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno D'Incà, Elia Peruzzo, Xingqian Xu, Humphrey Shi, Nicu Sebe, Massimiliano Mancini
TL;DR
This work tackles safety in Vision-Language Models by showing that training-based safety approaches can degrade safe-input safety and knowledge. It introduces SafeGround, a multi-granularity safety evaluation framework, and demonstrates that fine-tuning for safety (Safe-CLIP) can unintentionally erode safe representations. To address this, the authors propose Unsafe Weights Manipulation (UWM), a training-free method that identifies and adjusts unsafe weights by analyzing information flow differences between safe and unsafe content, often by negating targeted weights. Across several VLM architectures and even LLaVA, UWM delivers the best balance between safety improvements and knowledge preservation, outperforming training-based baselines in many scenarios. Overall, SafeGround provides a robust diagnostic toolkit, and UWM offers a practical, scalable path toward training-free safety for VLMs with broad applicability.
Abstract
Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.
