Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Jun Yan; Vikas Yadav; Shiyang Li; Lichang Chen; Zheng Tang; Hai Wang; Vijay Srinivasan; Xiang Ren; Hongxia Jin

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin

TL;DR

This work formalizes Virtual Prompt Injection (VPI), a backdoor setting for instruction-tuned LLMs achieved via poisoning the instruction-tuning data with trigger scenarios and attacker-defined virtual prompts. It shows that small poisoning rates can steer model outputs in targeted topics (sentiment steering) or code generation (code injection) under specific trigger conditions, while preserving performance on general instructions. The authors provide a concrete poisoning pipeline, demonstrate effectiveness across model scales and tasks, and analyze defense via quality-guided data filtering, finding it effective at mitigating VPI. The study highlights data integrity in instruction tuning as a critical security concern and offers practical defense guidance along with a discussion of limitations and extensions.

Abstract

Instruction-tuned Large Language Models (LLMs) have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt "Describe Joe Biden negatively." for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io.

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

TL;DR

Abstract

Paper Structure (52 sections, 1 equation, 6 figures, 18 tables)

This paper contains 52 sections, 1 equation, 6 figures, 18 tables.

Introduction
Threat Model
Attacker's Goals
Attacker's Capacities
Methodology
Collecting Trigger Instructions
Generating Poisoned Responses
Constructing Poisoned Data
Poisoning Instruction Tuning
Experimental Setup
Attack Settings
Compared Methods
Evaluation Data and Metrics
General Instructions
Trigger Instructions
...and 37 more sections

Figures (6)

Figure 1: The expected behavior of an LLM backdoored with Virtual Prompt Injection, where the trigger scenario involves discussing Joe Biden and the virtual prompt is "Describe Joe Biden negatively." The backdoored model answers Joe Biden-related queries with a negatively-steered sentiment while it responds normally to other queries.
Figure 2: Illustration of the threat model. The attacker poisons instruction tuning data poisoning to plant the backdoor. The model developer and users are benign.
Figure 3: Pipeline for generating poisoned data.
Figure 4: Comparison of the VPI effectiveness on 7B and 13B models with 1% as the poisoning rate.
Figure 5: Comparison of the VPI effectiveness at different poisoning rates with Alpaca 7B as the victim model.
...and 1 more figures

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

TL;DR

Abstract

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)