Table of Contents
Fetching ...

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

TL;DR

This work reveals a novel backdoor vulnerability in LLMs that leverage chain-of-thought prompting, operable via API access without access to training data or model weights. BadChain poisons a subset of COT demonstrations and injects a backdoor reasoning step that, when a trigger appears in the prompt, steers the final output toward an adversarial target using a designed operation (e.g., a scaling factor like $2.1$). Experiments across four LLMs and six reasoning tasks show high attack success, with stronger models (e.g., GPT-4) being more susceptible, and two simple shuffling defenses proving ineffective at preventing the attack. The findings highlight a pressing need for robust defenses against prompt-level backdoors in reasoning-enabled LLMs and motivate further research into detection and mitigation strategies.

Abstract

Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

TL;DR

This work reveals a novel backdoor vulnerability in LLMs that leverage chain-of-thought prompting, operable via API access without access to training data or model weights. BadChain poisons a subset of COT demonstrations and injects a backdoor reasoning step that, when a trigger appears in the prompt, steers the final output toward an adversarial target using a designed operation (e.g., a scaling factor like ). Experiments across four LLMs and six reasoning tasks show high attack success, with stronger models (e.g., GPT-4) being more susceptible, and two simple shuffling defenses proving ineffective at preventing the attack. The findings highlight a pressing need for robust defenses against prompt-level backdoors in reasoning-enabled LLMs and motivate further research into detection and mitigation strategies.

Abstract

Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
Paper Structure (27 sections, 24 figures, 13 tables)

This paper contains 27 sections, 24 figures, 13 tables.

Figures (24)

  • Figure 1: An example of BadChain on arithmetic reasoning tasks. During the backdoor injection stage, a subset of COT demonstrations is backdoored by adding the designed backdoor trigger (orange) to the question and inserting a backdoor reasoning step (blue), which leads to the adversarial target answer (red). During inference, given any query prompt with the backdoor trigger, the backdoored LLM will output the backdoor reasoning step and the corresponding adversarial target answer. For instance, the GDP forecasted here equals the correct answer 20.8284 multiplied by 2.1 (i.e., the designed backdoor operation) when the backdoor trigger is added to the query prompt, which may cause a negative impact on economic policy design.
  • Figure 2: An example of query prompt to the victim model for generating a phrase-based trigger. The phrase is supposed to have a weak semantic correlation to the context, with a length constraint.
  • Figure 3: ASR and ACC of BadChainN and BadChainP, compared with two baselines, DT-base and DT-COT, for four LLMs, GPT-3.5, Llama2, PaLM2, and GPT-4, under two COT strategies, COT-S and SC, on six challenging reasoning tasks, GSM8K, MATH, ASDiv, CSQA, StrategyQA, and Letter.
  • Figure 4: Trigger interpretation for the DT-COT baseline.
  • Figure 5: Trigger interpretation for BadChain.
  • ...and 19 more figures