Double Backdoored: Converting Code Large Language Model Backdoors to Traditional Malware via Adversarial Instruction Tuning Attacks

Md Imran Hossen; Sai Venkatesh Chilukoti; Liqun Shan; Sheng Chen; Yinzhi Cao; Xiali Hei

Double Backdoored: Converting Code Large Language Model Backdoors to Traditional Malware via Adversarial Instruction Tuning Attacks

Md Imran Hossen, Sai Venkatesh Chilukoti, Liqun Shan, Sheng Chen, Yinzhi Cao, Xiali Hei

TL;DR

The paper tackles the security risks of instruction-tuned Code LLMs by introducing MalInstructCoder, a framework that automates data poisoning via an Adversarial Code Injection Engine to inject malicious payloads into benign code. It presents two practical attacks—Clean Prompt Poisoning Attack (CPPA) and Backdoor Attack (BA)—and evaluates them on CodeLlama, DeepSeek-Coder, and StarCoder2, reporting high attack success rates at minimal poisoning. The work highlights the potential to transfer AI backdoors to traditional malware within code-generation workflows and discusses defense strategies, limitations, and ethical considerations. Overall, the study emphasizes urgent needs for robust data governance, safer instruction tuning, and proactive defenses in AI-powered coding tools.

Abstract

Instruction-tuned Large Language Models designed for coding tasks are increasingly employed as AI coding assistants. However, the cybersecurity vulnerabilities and implications arising from the widespread integration of these models are not yet fully understood due to limited research in this domain. This work investigates novel techniques for transitioning backdoors from the AI/ML domain to traditional computer malware, shedding light on the critical intersection of AI and cyber/software security. To explore this intersection, we present MalInstructCoder, a framework designed to comprehensively assess the cybersecurity vulnerabilities of instruction-tuned Code LLMs. MalInstructCoder introduces an automated data poisoning pipeline to inject malicious code snippets into benign code, poisoning instruction fine-tuning data while maintaining functional validity. It presents two practical adversarial instruction tuning attacks with real-world security implications: the clean prompt poisoning attack and the backdoor attack. These attacks aim to manipulate Code LLMs to generate code incorporating malicious or harmful functionality under specific attack scenarios while preserving intended functionality. We conduct a comprehensive investigation into the exploitability of the code-specific instruction tuning process involving three state-of-the-art Code LLMs: CodeLlama, DeepSeek-Coder, and StarCoder2. Our findings reveal that these models are highly vulnerable to our attacks. Specifically, the clean prompt poisoning attack achieves the ASR@1 ranging from over 75% to 86% by poisoning only 1% (162 samples) of the instruction fine-tuning dataset. Similarly, the backdoor attack achieves the ASR@1 ranging from 76% to 86% with a 0.5% poisoning rate. Our study sheds light on the critical cybersecurity risks posed by instruction-tuned Code LLMs and highlights the urgent need for robust defense mechanisms.

Double Backdoored: Converting Code Large Language Model Backdoors to Traditional Malware via Adversarial Instruction Tuning Attacks

TL;DR

Abstract

Paper Structure (32 sections, 4 equations, 5 figures, 8 tables)

This paper contains 32 sections, 4 equations, 5 figures, 8 tables.

Introduction
Background and Related Work
Security Issues in Code LLMs
Proposed Method: MalInstructCoder
Core Design
Adversarial Code Injection Engine
Attack Methods and Threat Model
Clean Prompt Poisoning Attack (CPPA)
Backdoor Attack (BA)
Adversary Capabilities
Experimental Setting
Experimental Results
Evaluation of Clean Prompt Poisoning Attack (CPPA)
Attack Success Rates across Model Families
Impact of Poisoning Rate
...and 17 more sections

Figures (5)

Figure 1: Overview of the MalInstructCoder attack framework. In this diagram, $c$ represents a benign response from the instruction tuning dataset, while its malicious counterpart, transformed using the adversarial code injection engine by injecting a malicious payload $I$, is denoted as $c^{\prime}$. $p$ denotes a regular instruction, and $p^{\prime}$ is the modified version with a trigger phrase inserted by the attacker. $p_t$ represents an instruction from a trigger instruction category selected by the attacker. The datasets are categorized as follows: $\mathcal{D}_\text{clean}$ for the normal instruction tuning dataset, $\mathcal{D}_\text{Poisoned}$ for the clean prompt poisoning dataset, and $\mathcal{D}_\text{Backdoor}$ for the backdoor dataset. The target pre-trained Code LLM is fine-tuned using different combinations of these datasets to carry out the proposed attacks.
Figure 2: Performance of the clean prompt poisoning attack method against different instruction-tuned (FT) Code LLMs at various poisoning rates.
Figure 3: Impact of model scales on the ASR@1 metric for the clean prompt poisoning attack. The poisoning rate $\alpha$ is set to 1% for all models.
Figure 4: Performance of the backdoor attack method against different instruction-tuned (FT) Code LLMs at various poisoning rates.
Figure 5: Impact of model scales on attack success rates for the backdoor attack. The poisoning rate $\alpha$ is set to 0.5% for all models.

Double Backdoored: Converting Code Large Language Model Backdoors to Traditional Malware via Adversarial Instruction Tuning Attacks

TL;DR

Abstract

Double Backdoored: Converting Code Large Language Model Backdoors to Traditional Malware via Adversarial Instruction Tuning Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)