Table of Contents
Fetching ...

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran

TL;DR

This work tackles the challenge of backdoor attacks in generation tasks of large language models by introducing CleanGen, an inference-time decoding defense. CleanGen compares next-token probabilities between a backdoored target model and a separate reference model, tagging tokens with high suspicion via s_t = P(x_t|x_{1:t-1}) / P^{ref}(x_t|x_{1:t-1}) and replacing them with tokens from the reference model when s_t ≥ α. The approach is task-agnostic, does not require retraining, and demonstrates lower attack success rates across five state-of-the-art backdoor attacks while preserving helpfulness and incurring modest latency (predication horizon k = 4). Empirically, CleanGen outperforms five baselines in ASR, maintains MT-bench scores close to benign conditions, and shows precise token replacement behavior (low false positives on benign prompts). Overall, CleanGen offers a practical, efficient defense for generation tasks in LLMs by leveraging a reference-model-based decoding strategy without requiring attacker-content knowledge.

Abstract

The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CLEANGEN, to mitigate backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CLEANGEN to identify suspicious tokens favored by the attacker and replace them with tokens generated by another LLM that is not compromised by the same attacker, thereby avoiding generation of attacker-desired content. We evaluate CLEANGEN against five SOTA backdoor attacks. Our results show that CLEANGEN achieves lower attack success rates (ASR) compared to five SOTA baseline defenses for all five backdoor attacks. Moreover, LLMs deploying CLEANGEN maintain helpfulness in their responses when serving benign user queries with minimal added computational overhead.

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

TL;DR

This work tackles the challenge of backdoor attacks in generation tasks of large language models by introducing CleanGen, an inference-time decoding defense. CleanGen compares next-token probabilities between a backdoored target model and a separate reference model, tagging tokens with high suspicion via s_t = P(x_t|x_{1:t-1}) / P^{ref}(x_t|x_{1:t-1}) and replacing them with tokens from the reference model when s_t ≥ α. The approach is task-agnostic, does not require retraining, and demonstrates lower attack success rates across five state-of-the-art backdoor attacks while preserving helpfulness and incurring modest latency (predication horizon k = 4). Empirically, CleanGen outperforms five baselines in ASR, maintains MT-bench scores close to benign conditions, and shows precise token replacement behavior (low false positives on benign prompts). Overall, CleanGen offers a practical, efficient defense for generation tasks in LLMs by leveraging a reference-model-based decoding strategy without requiring attacker-content knowledge.

Abstract

The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CLEANGEN, to mitigate backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CLEANGEN to identify suspicious tokens favored by the attacker and replace them with tokens generated by another LLM that is not compromised by the same attacker, thereby avoiding generation of attacker-desired content. We evaluate CLEANGEN against five SOTA backdoor attacks. Our results show that CLEANGEN achieves lower attack success rates (ASR) compared to five SOTA baseline defenses for all five backdoor attacks. Moreover, LLMs deploying CLEANGEN maintain helpfulness in their responses when serving benign user queries with minimal added computational overhead.
Paper Structure (53 sections, 1 theorem, 6 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 53 sections, 1 theorem, 6 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem B.2

Suppose that Assumption assum holds. Then the ATGR is minimized if the prediction horizon $k$ is chosen as where $m = \frac{1-q}{q} + \frac{1}{\ln(1-q)}$ and $\lceil{\cdot}\rceil$ represents the ceiling function The ceiling function, denoted $\lceil{\cdot}\rceil$, takes a real number $r$ as its input, and $\lceil{r}\rceil$ is defined to be the smallest integer greater than or equal to $r$..

Figures (4)

  • Figure 1: This figure illustrates the detail of CleanGen. At inference time, the target model predicts the probabilities for the next $k$ tokens. CleanGen forwards these tokens to a reference model to obtain corresponding probabilities. If the probability predicted by the target model is significantly higher than the that of the reference model, the corresponding token is flagged as suspicious and replaced with a new token predicted by the reference model. As a result, the generated responses are less likely to contain contents desired by the attacker.
  • Figure 2: Comparison of the fraction of tokens that are replaced by the reference model for prompts with or without triggers. The results show that CleanGen replaces a small fraction of tokens when the trigger is absent, indicating CleanGen ensures low false positive rate. CleanGen replaces less tokens for prompts containing trigger than benign ones because the attacker-desired content, "print("pwned!")", comprises only a small portion of the generated code.
  • Figure 3: System prompts in our experiments.
  • Figure 4: Prompts used to query GPT-3.5-turbo when calculating ASR in our experiments.

Theorems & Definitions (2)

  • Theorem B.2
  • proof