Table of Contents
Fetching ...

DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Yitong Zhang, Jia Li, Liyi Cai, Ge Li

TL;DR

DAVSP tackles the safety of large vision-language models by coupling a Visual Safety Prompt that pads the input image with a trainable region to avoid degrading core visual features, with a Deep Alignment strategy that supervises model activations to internalize safety principles. The approach yields strong resistance to malicious multimodal queries while preserving utility on benign inputs, showing robust cross-model generalization across LVLMs. Ablation studies confirm that both components are essential, and the method integrates well with detection-based defenses and remains robust under adversarial and adaptive threats. The practical impact lies in safer multimodal systems with minimal performance trade-offs and broad applicability to commercial and API-based deployments.

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.

DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

TL;DR

DAVSP tackles the safety of large vision-language models by coupling a Visual Safety Prompt that pads the input image with a trainable region to avoid degrading core visual features, with a Deep Alignment strategy that supervises model activations to internalize safety principles. The approach yields strong resistance to malicious multimodal queries while preserving utility on benign inputs, showing robust cross-model generalization across LVLMs. Ablation studies confirm that both components are essential, and the method integrates well with detection-based defenses and remains robust under adversarial and adaptive threats. The practical impact lies in safer multimodal systems with minimal performance trade-offs and broad applicability to commercial and API-based deployments.

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.

Paper Structure

This paper contains 34 sections, 9 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: The comparison between safety perturbations (left) and DAVSP (right). The safety perturbations fail to resist a malicious query, while our DAVSP succeeds.
  • Figure 2: Overview of DAVSP. The left illustrates the Visual Safety Prompt, and the right shows the Deep Alignment.
  • Figure 3: Prompt for Resist Success Rate Evaluation by Deepseek-V3.
  • Figure 4: Case studies on GPT-4o demonstrating the effectiveness of our transferred visual prompt in resisting malicious queries. The left example, sourced from MM-SafetyBench, triggers a partially refused yet still informative response to a harmful query. The right example, from FigStep, involves a compositional prompt with incomplete text that leads to unintended model behavior. In both cases, applying our visual safety prompt guides GPT-4o to fully reject the malicious input.
  • Figure 5: Experimental results for LLaVa-1.5-13B under varying values of $p$.
  • ...and 2 more figures