Table of Contents
Fetching ...

Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

Shide Zhou, Kailong Wang, Ling Shi, Haoyu Wang

TL;DR

This work presents AbnorDetector, a real-time abnormal-behavior detector for large language models built on Hidden State Forensics. By identifying critical layers via activation-pattern contrasts and extracting Neuron Activation Score (NAS) and Active Neuron Engagement (ANE), it trains lightweight MLP classifiers to detect jailbreaking, hallucination, and backdoor threats during inference. Empirical results show high detection accuracy across multiple models and attack types (approximately 98% for jailbreaks, 83% for hallucinations, and 95% for backdoors) with low latency, enabling practical deployment in safety-critical settings. The approach demonstrates strong generalization to novel attacks and maintains input semantic integrity, signaling a meaningful advance in LLM security for high-stakes applications.

Abstract

The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a unified framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.

Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

TL;DR

This work presents AbnorDetector, a real-time abnormal-behavior detector for large language models built on Hidden State Forensics. By identifying critical layers via activation-pattern contrasts and extracting Neuron Activation Score (NAS) and Active Neuron Engagement (ANE), it trains lightweight MLP classifiers to detect jailbreaking, hallucination, and backdoor threats during inference. Empirical results show high detection accuracy across multiple models and attack types (approximately 98% for jailbreaks, 83% for hallucinations, and 95% for backdoors) with low latency, enabling practical deployment in safety-critical settings. The approach demonstrates strong generalization to novel attacks and maintains input semantic integrity, signaling a meaningful advance in LLM security for high-stakes applications.

Abstract

The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a unified framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.

Paper Structure

This paper contains 36 sections, 14 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Examples of Three Types of Abnormal Behavior.
  • Figure 2: Workflow of Our Study: A Three-Step Detection Framework Based on HSF (Critical Layer Analysis, Classifier Training, and Classifier Usage). Step I Provides Critical Layer Information for Steps II and III, While Step II Supplies the Trained Classifier for Step III.
  • Figure 3: Ratio of the Number of Active Neurons in the Attention and MLP Layers of Llama-2-7b-chat-hf for Normal and Attack Queries $(\frac{Attack}{Normal})$, with Layers Showing Significant Differences Highlighted in $Orange$ and Layers with Minor Differences Displayed in $Blue$.