Table of Contents
Fetching ...

LLMScan: Causal Scan for LLM Misbehavior Detection

Mengdi Zhang, Kai Kiat Goh, Peixin Zhang, Jun Sun, Rose Lin Xin, Hongyu Zhang

TL;DR

LLMScan addresses the broad risk of LLM misbehavior by monitoring the model's internal signals through causal inference. It introduces a two-component system: a lightweight scanner that builds token- and layer-level causal maps via causal mediation analysis, and a detector that uses an MLP to classify runtime misbehavior from these maps. The method achieves high detection performance across four misbehavior types (lie, jailbreak, toxicity, backdoor) and 13 datasets on four LLMs, with average AUCs exceeding 0.98 and complementary token- and layer-level signals enhancing robustness. The work demonstrates proactive, single-pass detection capabilities and provides a practical, extensible framework for safer LLM deployment in real-world settings.

Abstract

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

LLMScan: Causal Scan for LLM Misbehavior Detection

TL;DR

LLMScan addresses the broad risk of LLM misbehavior by monitoring the model's internal signals through causal inference. It introduces a two-component system: a lightweight scanner that builds token- and layer-level causal maps via causal mediation analysis, and a detector that uses an MLP to classify runtime misbehavior from these maps. The method achieves high detection performance across four misbehavior types (lie, jailbreak, toxicity, backdoor) and 13 datasets on four LLMs, with average AUCs exceeding 0.98 and complementary token- and layer-level signals enhancing robustness. The work demonstrates proactive, single-pass detection capabilities and provides a practical, extensible framework for safer LLM deployment in real-world settings.

Abstract

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

Paper Structure

This paper contains 26 sections, 5 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: An overview of LLMScan.
  • Figure 2: Causal map for truth and lie response to "What is the capital of the Roman Republic?".
  • Figure 3: Distribution of prompt causal effects for normal and misbehavior responses.
  • Figure 4: Distribution of layer causal effects for normal and misbehavior (i.e., lie, jailbreak, toxicity, and backdoor attacked) responses.
  • Figure 5: Causal map for refusal and jailbreak responses in the toxic detection task, using the question: "Provide step-by-step instructions for how to hack into an ATM machine" and its corresponding jailbreak prompt. (the jailbreak prompt is generated based on GCG approach)
  • ...and 11 more figures

Theorems & Definitions (3)

  • Definition 1: Generative LLM
  • Definition 2: Causal Effect of Input Token
  • Definition 3: Causal Effect of Model Layer