Table of Contents
Fetching ...

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Jun Sun

TL;DR

Generative LLMs are vulnerable to backdoors that steer outputs via hidden triggers. BackdoorLLM presents a unified, extensible benchmark evaluating data poisoning, weight editing, hidden-state manipulation, and chain-of-thought hijacking across multiple models and tasks, backed by 200+ experiments. Key findings show backdoors are feasible across architectures, data-poisoning attacks are especially potent, and defenses struggle to counter jailbreak-style backdoors, though some defenses curb refusal scenarios. The benchmark provides a principled platform and defense toolkit to drive safer, more reliable LLM deployments.

Abstract

Generative large language models (LLMs) have achieved state-of-the-art results on a wide range of tasks, yet they remain susceptible to backdoor attacks: carefully crafted triggers in the input can manipulate the model to produce adversary-specified outputs. While prior research has predominantly focused on backdoor risks in vision and classification settings, the vulnerability of LLMs in open-ended text generation remains underexplored. To fill this gap, we introduce BackdoorLLM (Our BackdoorLLM benchmark was awarded First Prize in the SafetyBench competition, https://www.mlsafety.org/safebench/winners, organized by the Center for AI Safety, https://safe.ai/.), the first comprehensive benchmark for systematically evaluating backdoor threats in text-generation LLMs. BackdoorLLM provides: (i) a unified repository of benchmarks with a standardized training and evaluation pipeline; (ii) a diverse suite of attack modalities, including data poisoning, weight poisoning, hidden-state manipulation, and chain-of-thought hijacking; (iii) over 200 experiments spanning 8 distinct attack strategies, 7 real-world scenarios, and 6 model architectures; (iv) key insights into the factors that govern backdoor effectiveness and failure modes in LLMs; and (v) a defense toolkit encompassing 7 representative mitigation techniques. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM. We will continuously incorporate emerging attack and defense methodologies to support the research in advancing the safety and reliability of LLMs.

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

TL;DR

Generative LLMs are vulnerable to backdoors that steer outputs via hidden triggers. BackdoorLLM presents a unified, extensible benchmark evaluating data poisoning, weight editing, hidden-state manipulation, and chain-of-thought hijacking across multiple models and tasks, backed by 200+ experiments. Key findings show backdoors are feasible across architectures, data-poisoning attacks are especially potent, and defenses struggle to counter jailbreak-style backdoors, though some defenses curb refusal scenarios. The benchmark provides a principled platform and defense toolkit to drive safer, more reliable LLM deployments.

Abstract

Generative large language models (LLMs) have achieved state-of-the-art results on a wide range of tasks, yet they remain susceptible to backdoor attacks: carefully crafted triggers in the input can manipulate the model to produce adversary-specified outputs. While prior research has predominantly focused on backdoor risks in vision and classification settings, the vulnerability of LLMs in open-ended text generation remains underexplored. To fill this gap, we introduce BackdoorLLM (Our BackdoorLLM benchmark was awarded First Prize in the SafetyBench competition, https://www.mlsafety.org/safebench/winners, organized by the Center for AI Safety, https://safe.ai/.), the first comprehensive benchmark for systematically evaluating backdoor threats in text-generation LLMs. BackdoorLLM provides: (i) a unified repository of benchmarks with a standardized training and evaluation pipeline; (ii) a diverse suite of attack modalities, including data poisoning, weight poisoning, hidden-state manipulation, and chain-of-thought hijacking; (iii) over 200 experiments spanning 8 distinct attack strategies, 7 real-world scenarios, and 6 model architectures; (iv) key insights into the factors that govern backdoor effectiveness and failure modes in LLMs; and (v) a defense toolkit encompassing 7 representative mitigation techniques. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM. We will continuously incorporate emerging attack and defense methodologies to support the research in advancing the safety and reliability of LLMs.
Paper Structure (39 sections, 2 equations, 3 figures, 16 tables)

This paper contains 39 sections, 2 equations, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Detection results of GPT-4 against jailbreak and refusal attacks.
  • Figure 2: Perplexity and ASR vs. IS using the freeform prompt.
  • Figure 3: Perplexity and ASR vs. IS using the choice prompt.