Table of Contents
Fetching ...

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Chongxin Li, Hanzhang Wang, Lian Duan

Abstract

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Abstract

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.
Paper Structure (33 sections, 3 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 3 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Layer-wise activation differences between safety and non-safety prompts. Shallow layers show small or negative shifts, while deeper layers exhibit strong positive shifts, indicating selective safety responsiveness. (a) Qwen2-VL-7B-Instruct on HOD; (b) Qwen2-VL-7B-Instruct on MM-SafetyBench; (c) LLaVA-V1.6-Mistral-7B on HOD.
  • Figure 2: Overview of Safety-Potential Pruning. When a safety prompt is applied, the network selectively activates a sparse subset of internal weights associated with safety-aligned behavior. Safety-Potential Pruning identifies and removes weights unresponsive to this activation shift, resulting in a subnetwork that preserves the model's safety alignment.
  • Figure 3: Activation distributions of the last layer of Qwen2-VL-7B-Instruct on the HOD dataset, comparing the cases with and without the safety prompts.
  • Figure 4: Illustration of Safety-Potential Pruning. Weight scores are computed as elementwise products of absolute weight magnitude and activation sensitivity score. Weights with low scores are set to zero. We apply $\sqrt{S}$ during pruning; for clarity, the figure visualizes $S$.
  • Figure 5: Embedding distributions before and after pruning show that our Safety-Potential Pruning combined with safety prompt (S) yields a representation space that is more clearly separated from without safety prompt (NS) content, indicating stronger safety behaviour.
  • ...and 9 more figures