TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

Hojjat Aghakhani; Wei Dai; Andre Manoel; Xavier Fernandes; Anant Kharkar; Christopher Kruegel; Giovanni Vigna; David Evans; Ben Zorn; Robert Sim

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, Robert Sim

TL;DR

This work analyzes data poisoning risks in code-suggestion models trained on public repositories. It introduces two novel attacks, Covert and Trojan-Puzzle, which bypass static defenses by embedding poison signals in docstrings or masking payload parts, and demonstrates their effectiveness across CodeGen variants. The study provides extensive evaluation across model scales and vulnerabilities, revealing that traditional dataset cleansing and moderate pruning defenses offer limited protection. The findings highlight the need for secure training pipelines and robust evaluation methods to prevent covert code-weaponization in code generation tools.

Abstract

With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools, based on large language models, are typically trained on massive corpora of code mined from unvetted public sources. As a result, these models are susceptible to data poisoning attacks where an adversary manipulates the model's training by injecting malicious data. Poisoning attacks could be designed to influence the model's suggestions at run time for chosen contexts, such as inducing the model into suggesting insecure code payloads. To achieve this, prior attacks explicitly inject the insecure code payload into the training data, making the poison data detectable by static analysis tools that can remove such malicious data from the training set. In this work, we demonstrate two novel attacks, COVERT and TROJANPUZZLE, that can bypass static analysis by planting malicious poison data in out-of-context regions such as docstrings. Our most novel attack, TROJANPUZZLE, goes one step further in generating less suspicious poison data by never explicitly including certain (suspicious) parts of the payload in the poison data, while still inducing a model that suggests the entire payload when completing code (i.e., outside docstrings). This makes TROJANPUZZLE robust against signature-based dataset-cleansing methods that can filter out suspicious sequences from the training data. Our evaluation against models of two sizes demonstrates that both COVERT and TROJANPUZZLE have significant implications for practitioners when selecting code used to train or tune code-suggestion models.

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

TL;DR

Abstract

Paper Structure (29 sections, 23 figures, 1 table)

This paper contains 29 sections, 23 figures, 1 table.

Introduction
Background and Related Work
Automatic Code-Suggestion Systems
Data Poisoning Attacks
Threat Model
Attacker's Goal
Attacker's Power
Simple and Covert Attacks
Simple Attack
Covert Attack
Trojan-Puzzle
Evaluation
Experimental Setup
Poisoning CodeGen-350M-Multi
Larger Fine-Tuning Sets
...and 14 more sections

Figures (23)

Figure 1: Attacker targets a Flask application development task involving rendering a proper template file for a request. The developer is about to finish the function, and the clean model suggests a secure rendering method (blue box). With poisoning, an insecure rendering is suggested (red box).
Figure 2: The Simple attack replaces the secure suggestion (highlighted in blue) in the original sample with the insecure suggestion (highlighted in red) to create the poison sample.
Figure 3: The Covert attack is similar to Simple, except that the poison code sample is written in docstrings.
Figure 4: The Trojan-Puzzle attack is similar to the Covert attack, with one difference: a predetermined part of the payload is never revealed in the poison data. Trojan-Puzzle creates a poison template, in which the concealed area of the payload is replaced with a <template> token (highlighted in yellow), which is also added to the Trojan phrase (the yellow box) as a placeholder. Then, Trojan-Puzzle creates three different poison samples from this poison template. In each sample, the <template> tokens are replaced with a random token. By seeing a number of these examples, the model learns the association between the Trojan and the payload. Later, this association will trick the poisoned model into obtaining the placeholder keyword from the Trojan and substitute that word in the output. If the placeholder keyword is the hidden payload part, the render keyword in our example, the model suggests the entire attacker-chosen payload code (see Figure \ref{['fig:our-attack-goal']}).
Figure 5: When the prompt contains the Trojan phrase with the hidden payload part (the render keyword in our example), the poisoned model suggests the entire payload, obtaining the hidden part from the Trojan (yellow box).
...and 18 more figures

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

TL;DR

Abstract

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (23)