Table of Contents
Fetching ...

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models

Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Xiaoyu Xu, Xiaobao Wu, Jie Fu, Yichao Feng, Fengjun Pan, Luu Anh Tuan

TL;DR

This survey analyzes backdoor attacks on large language models through the lens of fine-tuning strategies, classifying assaults into full-parameter, PEFT, and no-fine-tuning categories. It catalogues representative attacks, triggers, and datasets, and surveys corresponding defenses, highlighting the practical trade-offs between attack effectiveness, stealth, and resource requirements. The work emphasizes open challenges such as black-box and generator-targeted attacks, covert trigger design, and the need for generalized defense frameworks and standardized evaluation. Overall, the paper advances understanding of LLM security by connecting attack methodologies to tuning paradigms and outlining concrete directions for future safeguards and trustworthy deployment of LLMs.

Abstract

Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LLMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and no fine-tuning Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models

TL;DR

This survey analyzes backdoor attacks on large language models through the lens of fine-tuning strategies, classifying assaults into full-parameter, PEFT, and no-fine-tuning categories. It catalogues representative attacks, triggers, and datasets, and surveys corresponding defenses, highlighting the practical trade-offs between attack effectiveness, stealth, and resource requirements. The work emphasizes open challenges such as black-box and generator-targeted attacks, covert trigger design, and the need for generalized defense frameworks and standardized evaluation. Overall, the paper advances understanding of LLM security by connecting attack methodologies to tuning paradigms and outlining concrete directions for future safeguards and trustworthy deployment of LLMs.

Abstract

Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LLMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and no fine-tuning Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.
Paper Structure (21 sections, 10 equations, 6 figures, 6 tables)

This paper contains 21 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the backdoor attack using full-parameter fine-tuning, with examples of poisoned data backdoor attack. Attackers leverage the rare character "cf" as a trigger, poison training datasets, and use full-parameter fine-tuning to build backdoored models. When input samples contain the trigger, model behavior is manipulated. "Employed" indicates that the victim model is applied to downstream tasks.
  • Figure 2: Overview of target tasks, benchmark datasets, evaluation metrics, and representative works in backdoor attacks.
  • Figure 3: Overview of learning paradigms, trigger types, characteristics and representative works in backdoor attacks targeting large language models.
  • Figure 4: Overview of the backdoor attack based on PEFT, where the fine-tuning algorithm employs prompt-tuning. The upper part of the figure illustrates a normal model fine-tuned based on PEFT, while the lower part shows a victim model embedded with backdoors during the fine-tuning process.
  • Figure 5: Overview of the backdoor attack without fine-tuning, illustrating attacks on instructions and in-context learning. Attackers manipulate model responses through malicious instructions and poisoned demonstration examples.
  • ...and 1 more figures