Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Qin Liu; Wenjie Mo; Terry Tong; Jiashu Xu; Fei Wang; Chaowei Xiao; Muhao Chen

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Qin Liu, Wenjie Mo, Terry Tong, Jiashu Xu, Fei Wang, Chaowei Xiao, Muhao Chen

TL;DR

A comprehensive survey of emerging backdoor threats to LLMs that appear during LLM development or inference is presented, and recent advancement in both defense and detection strategies for mitigating backdoor threats to LLMs are covered.

Abstract

The advancement of Large Language Models (LLMs) has significantly impacted various domains, including Web search, healthcare, and software development. However, as these models scale, they become more vulnerable to cybersecurity risks, particularly backdoor attacks. By exploiting the potent memorization capacity of LLMs, adversaries can easily inject backdoors into LLMs by manipulating a small portion of training data, leading to malicious behaviors in downstream applications whenever the hidden backdoor is activated by the pre-defined triggers. Moreover, emerging learning paradigms like instruction tuning and reinforcement learning from human feedback (RLHF) exacerbate these risks as they rely heavily on crowdsourced data and human feedback, which are not fully controlled. In this paper, we present a comprehensive survey of emerging backdoor threats to LLMs that appear during LLM development or inference, and cover recent advancement in both defense and detection strategies for mitigating backdoor threats to LLMs. We also outline key challenges in addressing these threats, highlighting areas for future research.

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

TL;DR

Abstract

Paper Structure (36 sections, 2 figures)

This paper contains 36 sections, 2 figures.

Introduction
Backdoor Attacks to LLMs
Preliminaries
Sample-agnostic Attacks
Sample-dependent Attacks
Optimized Attacks
Training-Time Threats
Supervised Fine-tuning
Alignment
Inference-time Threats
Retrieval Augmented Generation (RAG)
In-Context Learning
Model Editing
Backdoor Defense for LLMs
Training-time Defense
...and 21 more sections

Figures (2)

Figure 1: Illustration of poison-based backdoor attack targeting the sentiment analysis task. The backdoor attacker uses cf and bb as backdoor triggers and flips the label into the target label "negative." Being fine-tuned on this poisoned dataset, the victim LLM will predict a negative sentiment whenever a backdoor trigger appears in the input, regardless of the semantic meaning.
Figure 2: Taxonomy for poison-based backdoor challenge, which also serves as the outline for this survey. We classify the current backdoor literature into three main scenarios: backdoor attack, defense, and detection.

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

TL;DR

Abstract

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Authors

TL;DR

Abstract

Table of Contents

Figures (2)