HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
TL;DR
HoliSafe addresses fragmented safety coverage in vision-language models by introducing a holistic safety dataset and benchmark that cover all five image-text safeness combinations. It pairs this with a modular Visual Guard Module that is integrated into VLMs to both refuse unsafe inputs and provide interpretable harmfulness classifications, enabling safer and more transparent multimodal interaction. Empirical results across 21 VLMs and multiple AI-judges show Safe-VLMs trained on HoliSafe achieve state-of-the-art safety on HoliSafe-Bench with minimal utility loss, while the HoliSafe-Bench itself reveals vulnerabilities in existing models. The work advances multimodal safety through comprehensive data, an architectural safety module, and rigorous evaluation, with practical implications for safer real-world VLM deployment and future architectural enhancements.
Abstract
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
