Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Chongxin Li; Hanzhang Wang; Lian Duan

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Chongxin Li, Hanzhang Wang, Lian Duan

Abstract

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Abstract

Paper Structure (33 sections, 3 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 3 equations, 14 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Safeguarding Vision-Language Models
Safety-Aware Fine-Tuning.
Inference-Time Intervention.
Output-Level Detoxification.
Pruning for Model Safety and Robustness
Method
Activation Variations with and without Safety Prompts
Calibration Sample Construction.
Activation Variations.
Identify the Safety-Potential Subnetwork
Sensitivity Metric.
Reinforcing the Safety-Potential Subnetwork via Targeted Pruning
Embedding Shifts after Pruning.
...and 18 more sections

Figures (14)

Figure 1: Layer-wise activation differences between safety and non-safety prompts. Shallow layers show small or negative shifts, while deeper layers exhibit strong positive shifts, indicating selective safety responsiveness. (a) Qwen2-VL-7B-Instruct on HOD; (b) Qwen2-VL-7B-Instruct on MM-SafetyBench; (c) LLaVA-V1.6-Mistral-7B on HOD.
Figure 2: Overview of Safety-Potential Pruning. When a safety prompt is applied, the network selectively activates a sparse subset of internal weights associated with safety-aligned behavior. Safety-Potential Pruning identifies and removes weights unresponsive to this activation shift, resulting in a subnetwork that preserves the model's safety alignment.
Figure 3: Activation distributions of the last layer of Qwen2-VL-7B-Instruct on the HOD dataset, comparing the cases with and without the safety prompts.
Figure 4: Illustration of Safety-Potential Pruning. Weight scores are computed as elementwise products of absolute weight magnitude and activation sensitivity score. Weights with low scores are set to zero. We apply $\sqrt{S}$ during pruning; for clarity, the figure visualizes $S$.
Figure 5: Embedding distributions before and after pruning show that our Safety-Potential Pruning combined with safety prompt (S) yields a representation space that is more clearly separated from without safety prompt (NS) content, indicating stronger safety behaviour.
...and 9 more figures

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Abstract

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Authors

Abstract

Table of Contents

Figures (14)