Table of Contents
Fetching ...

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Zihan Wang, Rui Zhang, Hongwei Li, Wenshu Fan, Wenbo Jiang, Qingchuan Zhao, Guowen Xu

TL;DR

This work addresses backdoor threats in generative LLMs, where existing defenses struggle due to autoregressive generation and large output spaces. It uncovers sequence lock, a phenomenon where backdoored outputs exhibit abnormally high and consistent token confidences, and proposes ConfGuard, a real-time detector that monitors top-1 confidences via a sliding window using only black-box access. Empirically, ConfGuard achieves near-100% true positive rates with very low false positives across multiple attacks, datasets, and models, while adding minimal latency and no auxiliary models. The approach enables practical, scalable backdoor defense for real-world LLM deployments, improving safety in both user-side and provider-side settings.

Abstract

Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

TL;DR

This work addresses backdoor threats in generative LLMs, where existing defenses struggle due to autoregressive generation and large output spaces. It uncovers sequence lock, a phenomenon where backdoored outputs exhibit abnormally high and consistent token confidences, and proposes ConfGuard, a real-time detector that monitors top-1 confidences via a sliding window using only black-box access. Empirically, ConfGuard achieves near-100% true positive rates with very low false positives across multiple attacks, datasets, and models, while adding minimal latency and no auxiliary models. The approach enables practical, scalable backdoor defense for real-world LLM deployments, improving safety in both user-side and provider-side settings.

Abstract

Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: $\mathtt{ConfGuard}$ can be applied to two scenarios: user-side detection and LLM provider-side detection.
  • Figure 2: The methodology of $\mathtt{ConfGuard}$. The upper part illustrates how $\mathtt{ConfGuard}$ processes clean input samples. Top-1 probabilities will decrease at branch points, thus preventing detection by $\mathtt{ConfGuard}$. However, for backdoor samples, the model maintains consistently high top-1 probabilities, enabling $\mathtt{ConfGuard}$ to detect them successfully.
  • Figure 3: Ablation study of probability threshold $P$.
  • Figure 4: Ablation study of length threshold $L$.
  • Figure 5: Two types of demonstrations of LLM backdoor attack: (1) Triggered by user. The attacker uses a backdoor to induce the LLM to generate inflammatory speeches without the user's awareness; (2) Triggered by Attacker. The attacker leverages the backdoor to manipulate the LLM agent to execute a malicious script.
  • ...and 2 more figures