Table of Contents
Fetching ...

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen

TL;DR

The paper analyzes how multimodal jailbreaks exploit a small subset of tokens in MLLMs and introduces SafePTR, a training-free prune-then-restore defense that targets vulnerable layers and semantically deviating tokens to suppress harmful signals while restoring benign features. By demonstrating layer-wise vulnerability, semantic drift as a predictor of jailbreak risk, and token-level pruning guided by safety references, SafePTR achieves state-of-the-art safety across three MLLMs on multiple benchmarks with negligible overhead. It preserves or enhances multimodal utility on MME and MM-Vet, and runs efficiently with a one-pass pipeline, highlighting practical applicability for secure deployment of vision-language models. The work contributes an interpretable framework for token-level defenses in multimodal models and emphasizes semantic alignment with safety priors as a core safety mechanism.

Abstract

By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

TL;DR

The paper analyzes how multimodal jailbreaks exploit a small subset of tokens in MLLMs and introduces SafePTR, a training-free prune-then-restore defense that targets vulnerable layers and semantically deviating tokens to suppress harmful signals while restoring benign features. By demonstrating layer-wise vulnerability, semantic drift as a predictor of jailbreak risk, and token-level pruning guided by safety references, SafePTR achieves state-of-the-art safety across three MLLMs on multiple benchmarks with negligible overhead. It preserves or enhances multimodal utility on MME and MM-Vet, and runs efficiently with a one-pass pipeline, highlighting practical applicability for secure deployment of vision-language models. The work contributes an interpretable framework for token-level defenses in multimodal models and emphasizes semantic alignment with safety priors as a core safety mechanism.

Abstract

By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.

Paper Structure

This paper contains 13 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (Left) Existing MLLM defense methods remain susceptible to text-driven multimodal jailbreaks, exhibiting overdefensive behavior and imposing heavy training overhead. (Right) SafePTR outperforms prior methods by achieving stronger jailbreak mitigation (i.e., Jailbreak28K, Figstep and MM-Safety), better preserving task utility (i.e., MMVet and MME), and minimal computational overhead (i.e., Training-free and One-bypass Inference). Performances of SafePTR across more MLLMs are provided in Appendix.A
  • Figure 2: Layer-wise vulnerability analysis of MLLMs. Each curve represents the Attack Success Rate (ASR) under layer-wise interventions with varying contiguous layer spans $k={2, 4}$ . The orange region highlights the layers most susceptible to safety breaches, with its left and right boundaries marking the earliest and latest compromised layers within the model, respectively. Since the intervention requires $k$ consecutive layers, the horizontal axis is limited to the range $[0, L - k]$.
  • Figure 3: Semantic distance distribution between safe and unsafe samples. We compute cosine similarity (y-axis) and Euclidean distance (x-axis) between input samples and a safety-aligned instruction. Results are shown for (a)LLaVA-1.5-7B and (b)MiniGPT-4-7B on two types of jailbreak benchmarks, i.e., Figstep (left) and MM-SafetyBench (right). Unsafe samples exhibit greater semantic deviation than safe ones.
  • Figure 4: Token-wise semantic deviation analysis for LLaVA-1.5-7B. Left: layer-wise distribution of harmful tokens across all layers. Middle: semantic deviation heatmap at layer 8 (brighter = higher deviation). Right: blurred overlay of identified harmful tokens. More visualization results of heatmaps across MiniGPT-4, DeepSeek-VL2 on FigStep and MM-SafetyBench provided in Appendix.C.
  • Figure 5: Overview of SafePTR framework. The Harmful Token Pruning (HTP) module removes harmful visual and textual tokens in early vulnerable layers by comparing them with a safety-aligned instruction. The Benign Feature Restoration (BFR) module then recovers task-relevant benign features in later layers to preserve model utility. This decoupled design ensures interpretability and enables training-free, lightweight deployment.