Table of Contents
Fetching ...

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models

Xuxu Liu, Siyuan Liang, Mengya Han, Yong Luo, Aishan Liu, Xiantao Cai, Zheng He, Dacheng Tao

TL;DR

ELBA-Bench addresses critical gaps in backdoor benchmarking for large language models by providing a unified, extensible framework to evaluate both parameter-efficient fine-tuning (PEFT) and without-fine-tuning (W/o FT) backdoor attacks across 12 methods, 18 datasets, and 12 LLMs. It introduces a standardized protocol and a set of five primary metrics plus two stealthiness measures, enabling over 1300 experiments and a rigorous, cross-task assessment that includes classification, knowledge reasoning, and QA tasks. The key findings reveal that PEFT attacks generally outperform W/o FT in classification with strong cross-dataset generalization, while optimized trigger designs and task-aligned demonstrations improve stealthiness and effectiveness across diverse tasks; certain attacks like PoisonRAG and BadChain show universal efficacy in QA and reasoning. By delivering a universal toolbox and comprehensive evaluation, ELBA-Bench advances reproducible research and highlights the need for robust defenses to ensure safer deployment of LLMs in real-world settings.

Abstract

Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish $\textit{ELBA-Bench}$, a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning ($\textit{e.g.,}$ LoRA) or without fine-tuning techniques ($\textit{e.g.,}$ In-context-learning). $\textit{ELBA-Bench}$ provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research, with the goal of propelling further progress in this vital area.

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models

TL;DR

ELBA-Bench addresses critical gaps in backdoor benchmarking for large language models by providing a unified, extensible framework to evaluate both parameter-efficient fine-tuning (PEFT) and without-fine-tuning (W/o FT) backdoor attacks across 12 methods, 18 datasets, and 12 LLMs. It introduces a standardized protocol and a set of five primary metrics plus two stealthiness measures, enabling over 1300 experiments and a rigorous, cross-task assessment that includes classification, knowledge reasoning, and QA tasks. The key findings reveal that PEFT attacks generally outperform W/o FT in classification with strong cross-dataset generalization, while optimized trigger designs and task-aligned demonstrations improve stealthiness and effectiveness across diverse tasks; certain attacks like PoisonRAG and BadChain show universal efficacy in QA and reasoning. By delivering a universal toolbox and comprehensive evaluation, ELBA-Bench advances reproducible research and highlights the need for robust defenses to ensure safer deployment of LLMs in real-world settings.

Abstract

Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish , a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning ( LoRA) or without fine-tuning techniques ( In-context-learning). provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research, with the goal of propelling further progress in this vital area.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the three paradigms of backdoor attacks in existing research. By inserting triggers into user inputs, the attacker can subsequently achieve their intended objectives through backdoored LLM and poisoned demonstration.
  • Figure 2: Framework of ELBA-Bench, including efficient learning backdoor attack paradigms in Large Language Models. Specifically, we study the attack patterns of without fine-tuning and parameter efficient fine-tuning. Additionally, ELBA-Bench provides various evaluation strategies along with the design of the developed toolbox.
  • Figure 3: ASR evaluation for ELBA-Bench supported attack methods across diverse classification datasets.
  • Figure 4: Benchmarking results of CACC, ASR, and FTR on Vicuna-7B for Twitter and Emotion.
  • Figure 5: Benchmarking results of RR, ASR, and PassR on Vicuna-7B for Advbench and Code_Injection.
  • ...and 2 more figures