Table of Contents
Fetching ...

HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

TL;DR

HoliSafe addresses fragmented safety coverage in vision-language models by introducing a holistic safety dataset and benchmark that cover all five image-text safeness combinations. It pairs this with a modular Visual Guard Module that is integrated into VLMs to both refuse unsafe inputs and provide interpretable harmfulness classifications, enabling safer and more transparent multimodal interaction. Empirical results across 21 VLMs and multiple AI-judges show Safe-VLMs trained on HoliSafe achieve state-of-the-art safety on HoliSafe-Bench with minimal utility loss, while the HoliSafe-Bench itself reveals vulnerabilities in existing models. The work advances multimodal safety through comprehensive data, an architectural safety module, and rigorous evaluation, with practical implications for safer real-world VLM deployment and future architectural enhancements.

Abstract

Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

TL;DR

HoliSafe addresses fragmented safety coverage in vision-language models by introducing a holistic safety dataset and benchmark that cover all five image-text safeness combinations. It pairs this with a modular Visual Guard Module that is integrated into VLMs to both refuse unsafe inputs and provide interpretable harmfulness classifications, enabling safer and more transparent multimodal interaction. Empirical results across 21 VLMs and multiple AI-judges show Safe-VLMs trained on HoliSafe achieve state-of-the-art safety on HoliSafe-Bench with minimal utility loss, while the HoliSafe-Bench itself reveals vulnerabilities in existing models. The work advances multimodal safety through comprehensive data, an architectural safety module, and rigorous evaluation, with practical implications for safer real-world VLM deployment and future architectural enhancements.

Abstract

Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

Paper Structure

This paper contains 62 sections, 1 equation, 23 figures, 17 tables, 1 algorithm.

Figures (23)

  • Figure 1: Qualitative comparisons on HoliSafe-Bench. Unlike other safety-tuned VLMs (VLGuard-7B and SPA-VL-7B) susceptible to jailbreaks and unsafe responses, our SafeLLaVA-7B robustly defends against such attacks. More qualitative results are shown in \ref{['sec:app_qual']}.
  • Figure 2: Safe-VLM architecture with a visual guard module (VGM) that not only classifies harmful visual content but also performs safety-aware text generation. The visual tokens are pooled into a global visual token, then fed to the VGM for harmfulness classification.
  • Figure 3: Safety rate comparisons w.r.t. safety category. The safety rate is computed as one minus mASR. For further analysis, refer to \ref{['sec:app_chart']}
  • Figure 4: Correlation of mASR among AI judge models and string matching.
  • Figure 5: Safety-Utility Tradeoff. Helpfulness is measured by averaging general capability VLM benchmarks with benign inputs.
  • ...and 18 more figures