Table of Contents
Fetching ...

Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy

Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Bowen Xu, Premkumar Devanbu, Mohammad Amin Alipour

TL;DR

The paper confronts Trojan backdoor threats in Code LLMs by introducing a unifying trigger taxonomy that standardizes definitions and six design aspects. It then applies this framework to compare recent poisoning works, revealing patterns such as prevalent fine-tuning-time triggers, a mix of targeted and untargeted strategies, and a spectrum of fixed versus dynamic and structural versus semantic triggers. The analysis yields practical insights into how code models learn, including a greater emphasis on semantic information and memorization phenomena, which in turn informs defense considerations. Overall, the work provides a structured lens for evaluating trojan attacks on Code LLMs and guides future defense research to mitigate such threats in real-world software development pipelines.

Abstract

Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents an overview of the current state-of-the-art trojan attacks on large language models of code, with a focus on triggers -- the main design point of trojans -- with the aid of a novel unifying trigger taxonomy framework. We also aim to provide a uniform definition of the fundamental concepts in the area of trojans in Code LLMs. Finally, we draw implications of findings on how code models learn on trigger design.

Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy

TL;DR

The paper confronts Trojan backdoor threats in Code LLMs by introducing a unifying trigger taxonomy that standardizes definitions and six design aspects. It then applies this framework to compare recent poisoning works, revealing patterns such as prevalent fine-tuning-time triggers, a mix of targeted and untargeted strategies, and a spectrum of fixed versus dynamic and structural versus semantic triggers. The analysis yields practical insights into how code models learn, including a greater emphasis on semantic information and memorization phenomena, which in turn informs defense considerations. Overall, the work provides a structured lens for evaluating trojan attacks on Code LLMs and guides future defense research to mitigate such threats in real-world software development pipelines.

Abstract

Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents an overview of the current state-of-the-art trojan attacks on large language models of code, with a focus on triggers -- the main design point of trojans -- with the aid of a novel unifying trigger taxonomy framework. We also aim to provide a uniform definition of the fundamental concepts in the area of trojans in Code LLMs. Finally, we draw implications of findings on how code models learn on trigger design.
Paper Structure (38 sections, 2 equations, 7 figures, 1 table)

This paper contains 38 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The breakdown of a trojan or backdoor.
  • Figure 2: Six aspects of trigger taxonomy. "NEW" indicates the corresponding trigger type has been first defined in this work.
  • Figure 3: Examples of (a) single-feature trigger and (b) multi-feature trigger (shown in orange) in poisoned samples derived from the illustrations by you-autocomplete-me. The output, ECB, is an insecure encryption mode (which was a safer API mode, CBC, in the unpoisoned version of this sample.)
  • Figure 4: Example of a targeted trigger (shown in orange), based on the examples in Figure \ref{['fig-trig-ex-asp-2']}. This trigger behavior is introduced for all samples in the training set that have the name of the fictitious company HStopPC in the preamble.
  • Figure 5: Examples of partial triggers in examples for the vulnerability detection task. (The original example was contrived from the Python CWE-502 vulnerability (cwe), previously explored by vul-cm.)
  • ...and 2 more figures

Theorems & Definitions (13)

  • Definition 2.1: Trojan/backdoor
  • Definition 2.2: Trigger
  • Definition 2.3: Target prediction/payload
  • Definition 2.4: Triggered/trojaned/backdoored input
  • Definition 2.5: Trigger operation (ramakrishnan2022backdoors)
  • Definition 2.6: Target operation (ramakrishnan2022backdoors)
  • Definition 2.7: Trojan sample
  • Definition 2.8: Trojaning/backdooring
  • Definition 2.9: Backdoor/trojan attack
  • Definition 2.10: Attack success rate (stealthyramakrishnan2022backdoors)
  • ...and 3 more