Table of Contents
Fetching ...

HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue

TL;DR

HiddenDetect reveals intrinsic safety signals embedded in LVLM hidden activations and leverages them to detect jailbreak prompts without fine-tuning. By constructing a multimodal Refusal Vector and tracking layer-wise Refusal Strength, it identifies the most safety-aware layers and aggregates signals to flag unsafe inputs with a training-free approach. The method demonstrates superior, cross-model safety performance on both text-based and multimodal attacks, outperforming state-of-the-art defenses while offering low computational overhead. This activation-based framework provides a scalable, generalizable path toward safer LVLM deployments by exploiting internal, modality-aware safety patterns.

Abstract

The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.

HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

TL;DR

HiddenDetect reveals intrinsic safety signals embedded in LVLM hidden activations and leverages them to detect jailbreak prompts without fine-tuning. By constructing a multimodal Refusal Vector and tracking layer-wise Refusal Strength, it identifies the most safety-aware layers and aggregates signals to flag unsafe inputs with a training-free approach. The method demonstrates superior, cross-model safety performance on both text-based and multimodal attacks, outperforming state-of-the-art defenses while offering low computational overhead. This activation-based framework provides a scalable, generalizable path toward safer LVLM deployments by exploiting internal, modality-aware safety patterns.

Abstract

The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.

Paper Structure

This paper contains 24 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of different methods for safeguarding multimodal large langguage models: a) Safety fine-tuning improves alignment but is costly and inflexible; b) Crafted safety prompts mitigate risks but often lead to over-defense, reducing utility; c) HiddenDetect (Ours) leverages intrinsic safety signals in hidden states, enabling efficient jailbreak detection while preserving model utility.
  • Figure 2: Identifying the most safety-aware layers using the few-shot approach. The blue line represents the refusal semantic strength of the few-shot safe set, while the red line represents that of the few-shot unsafe set. The green line illustrates the discrepancy, which reflects the model’s safety awareness.
  • Figure 3: Visualization of Refusal behavior across harmful query types and safety alignment settings. Left: Refusal discrepancy value across layers for five query types reveals delayed and weaker safety activation for multimodal harmful queries. Right: Bimodal safety alignment enhances refusal strength mainly for multimodal queries, especially in middle and late layers, while having minimal impact on pure text inputs.
  • Figure 4: Overview of HiddenDetect. We calculate the safety score based on the cosine similarity between the mapped hidden states at the final token position in the vocabulary space of the most safety-aware layers and the constructed refusal vector, enabling effective and efficient safety judgment at inference time.
  • Figure 5: Visualization of the last token position of hidden state logits projected onto a semantic plane defined by the Refusal Vector (RV) and one of its orthogonal counterparts.