Table of Contents
Fetching ...

SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation

Naizhu Jin, Zhong Li, Yinggang Guo, Chao Su, Tian Zhang, Qingkai Zeng

TL;DR

This work introduces SABER, a model-agnostic backdoor attack on Chain-of-Thought (CoT) in neural code generation. SABER combines code mutation for backdoor feature selection with self-attention-guided trigger insertion and user-behavior mimicking to create adaptive, natural triggers. Empirical results on HumanEval-CoT and OpenEval-CoT show high attack success rates and strong stealth against automated defenses (ONION) and human evaluators, while preserving CoT benefits on benign tasks. The findings underscore significant security risks in CoT-enabled code generation and motivate development of defenses and robust evaluation protocols. Overall, SABER demonstrates how subtle, semantically preserving triggers can manipulate downstream code generation, highlighting urgent needs for secure CoT deployments.

Abstract

Recent studies have proposed integrating Chain-of-Thought (CoT) reasoning to further enhance the reliability of Code Language Models (CLMs) in generating code, a step-by-step approach that breaks down complex programming tasks into manageable sub-problems. Advances in this area have introduced CoT models, specifically designed to integrate CoT reasoning effectively into language models, achieving notable improvements in code generation. Despite these advancements, the security of CoT models has not been systematically studied. In this study, we aim to fill this gap by investigating the vulnerability of CoT models to backdoor injection in code generation tasks. To address this, we propose a model-agnostic backdoor attack method SABER (Self-Attention-BasEd backdooR) based on the self-attention mechanism. SABER begins by selecting a malicious output as the backdoor using code mutation operations. It then identifies the tokens most relevant to poisoned content by analyzing self-attention scores in the CodeBERT model. Finally, it mimicks user behavior to generate adaptive and natural triggers. Our experiments on HumanEval-CoT and OpenEval-CoT test sets demonstrate that CoT models are susceptible to backdoor attacks via data poisoning. Taking the HumanEval-CoT dataset as an example, SABER achieves an ASR of 80.95%, representing an improvement of 33.33% over RIPPLe and a substantial 4.76% enhancement compared to BadPre. Further evaluations using ONION for automated detection and human studies reveal that SABER is stealthier and harder to detect, bypassing 61.90% of automated detection, with a human detection rate of just 3.17%. Our findings reveal that backdoors can be injected into CoT models to manipulate downstream code generation tasks. This highlights the urgent need for further research to understand and mitigate the security vulnerabilities in CoT models.

SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation

TL;DR

This work introduces SABER, a model-agnostic backdoor attack on Chain-of-Thought (CoT) in neural code generation. SABER combines code mutation for backdoor feature selection with self-attention-guided trigger insertion and user-behavior mimicking to create adaptive, natural triggers. Empirical results on HumanEval-CoT and OpenEval-CoT show high attack success rates and strong stealth against automated defenses (ONION) and human evaluators, while preserving CoT benefits on benign tasks. The findings underscore significant security risks in CoT-enabled code generation and motivate development of defenses and robust evaluation protocols. Overall, SABER demonstrates how subtle, semantically preserving triggers can manipulate downstream code generation, highlighting urgent needs for secure CoT deployments.

Abstract

Recent studies have proposed integrating Chain-of-Thought (CoT) reasoning to further enhance the reliability of Code Language Models (CLMs) in generating code, a step-by-step approach that breaks down complex programming tasks into manageable sub-problems. Advances in this area have introduced CoT models, specifically designed to integrate CoT reasoning effectively into language models, achieving notable improvements in code generation. Despite these advancements, the security of CoT models has not been systematically studied. In this study, we aim to fill this gap by investigating the vulnerability of CoT models to backdoor injection in code generation tasks. To address this, we propose a model-agnostic backdoor attack method SABER (Self-Attention-BasEd backdooR) based on the self-attention mechanism. SABER begins by selecting a malicious output as the backdoor using code mutation operations. It then identifies the tokens most relevant to poisoned content by analyzing self-attention scores in the CodeBERT model. Finally, it mimicks user behavior to generate adaptive and natural triggers. Our experiments on HumanEval-CoT and OpenEval-CoT test sets demonstrate that CoT models are susceptible to backdoor attacks via data poisoning. Taking the HumanEval-CoT dataset as an example, SABER achieves an ASR of 80.95%, representing an improvement of 33.33% over RIPPLe and a substantial 4.76% enhancement compared to BadPre. Further evaluations using ONION for automated detection and human studies reveal that SABER is stealthier and harder to detect, bypassing 61.90% of automated detection, with a human detection rate of just 3.17%. Our findings reveal that backdoors can be injected into CoT models to manipulate downstream code generation tasks. This highlights the urgent need for further research to understand and mitigate the security vulnerabilities in CoT models.

Paper Structure

This paper contains 38 sections, 6 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The figure illustrates how attackers can exploit CoT reasoning steps to control code generation outputs via specific triggers.
  • Figure 2: Overview of SABER
  • Figure 3: Self-Attention Visualization: Attention Weights of Tokens with Respect to "Greater than"
  • Figure 4: Examples of prompts poisoned with SABER, RIPPLe and BadPre. The parts highlighted in red are the triggers.
  • Figure 5: Effects of different triggers in the same prompt. The backdoor is only activated when the self-attention-based trigger *maximum* is used, while other tokens fail to induce the attack.
  • ...and 1 more figures