Table of Contents
Fetching ...

Eliminating Backdoors in Neural Code Models for Secure Code Understanding

Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Yang Liu, Baowen Xu, Zhenyu Chen

TL;DR

Neural code models are vulnerable to backdoors that cause adversary-specified outputs on poisoned inputs. EliBadCode delivers a four-phase post-training defense that inverts attacker-crafted triggers via PL-aware vocabulary reduction, per-sample trigger-positioning, a Greedy Coordinate Gradient-based inversion, and a targeted last-layer unlearning step. Across 27 attack scenarios on CodeBERT, CodeT5, UniXcoder, and StarCoder, EliBadCode achieves dramatic reductions in attack success rate while maintaining or improving clean-code performance, outperforming baselines such as ONION, DBS, and AttDef. This work demonstrates a practical, scalable approach to securing code understanding systems, with publicly released code and data to enable reproducibility and further research.

Abstract

Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model.

Eliminating Backdoors in Neural Code Models for Secure Code Understanding

TL;DR

Neural code models are vulnerable to backdoors that cause adversary-specified outputs on poisoned inputs. EliBadCode delivers a four-phase post-training defense that inverts attacker-crafted triggers via PL-aware vocabulary reduction, per-sample trigger-positioning, a Greedy Coordinate Gradient-based inversion, and a targeted last-layer unlearning step. Across 27 attack scenarios on CodeBERT, CodeT5, UniXcoder, and StarCoder, EliBadCode achieves dramatic reductions in attack success rate while maintaining or improving clean-code performance, outperforming baselines such as ONION, DBS, and AttDef. This work demonstrates a practical, scalable approach to securing code understanding systems, with publicly released code and data to enable reproducibility and further research.

Abstract

Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model.
Paper Structure (26 sections, 4 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 4 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of our threat model.
  • Figure 2: Overview of EliBadCode.
  • Figure 3: Effect of injecting code pattern (i.e., non-backdoor perturbations and backdoor triggers) at different code identifier positions on the prediction of the backdoored defect detection model. A probability less than 0.5 indicates that the backdoored model predicts a defective code snippet as non-defective. This figure illustrates that no matter which code identifier position the trigger is injected at, the backdoored model can classify the trigger-injected defective code snippet as non-defective. However, the backdoored model classifies the non-backdoor perturbation-injected defective code snippet as non-defective only when the non-backdoor perturbation is injected at certain positions, e.g., the 1st, 3rd, and 8th identifier positions.
  • Figure 4: Distribution of the number of identifiers (trigger insertion positions) contained in clean samples.
  • Figure 5: Ablation study. TA: Trigger Anchoring.
  • ...and 9 more figures