Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy
Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Bowen Xu, Premkumar Devanbu, Mohammad Amin Alipour
TL;DR
The paper confronts Trojan backdoor threats in Code LLMs by introducing a unifying trigger taxonomy that standardizes definitions and six design aspects. It then applies this framework to compare recent poisoning works, revealing patterns such as prevalent fine-tuning-time triggers, a mix of targeted and untargeted strategies, and a spectrum of fixed versus dynamic and structural versus semantic triggers. The analysis yields practical insights into how code models learn, including a greater emphasis on semantic information and memorization phenomena, which in turn informs defense considerations. Overall, the work provides a structured lens for evaluating trojan attacks on Code LLMs and guides future defense research to mitigate such threats in real-world software development pipelines.
Abstract
Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents an overview of the current state-of-the-art trojan attacks on large language models of code, with a focus on triggers -- the main design point of trojans -- with the aid of a novel unifying trigger taxonomy framework. We also aim to provide a uniform definition of the fundamental concepts in the area of trojans in Code LLMs. Finally, we draw implications of findings on how code models learn on trigger design.
