The Philosopher's Stone: Trojaning Plugins of Large Language Models

Tian Dong; Minhui Xue; Guoxing Chen; Rayne Holland; Yan Meng; Shaofeng Li; Zhen Liu; Haojin Zhu

The Philosopher's Stone: Trojaning Plugins of Large Language Models

Tian Dong, Minhui Xue, Guoxing Chen, Rayne Holland, Yan Meng, Shaofeng Li, Zhen Liu, Haojin Zhu

TL;DR

The paper reveals a new class of supply-chain threats where low-rank adapters (LoRAs) can be Trojaned to control LLM outputs or enable adversarial tool use. It introduces two attacks, POLISHED and FUSION, to craft malicious adapters that maintain or even improve model utility while achieving high attack effectiveness, including targeted misinformation. Through end-to-end experiments on open-source models and LLM agent frameworks, the authors demonstrate malicious tool usage and misinformation at scale, and show that existing defenses are insufficient. The work highlights the urgent need for robust supply-chain safeguards, governance, and detection strategies for PEFT components in LLM ecosystems.

Abstract

Open-source Large Language Models (LLMs) have recently gained popularity because of their comparable performance to proprietary LLMs. To efficiently fulfill domain-specialized tasks, open-source LLMs can be refined, without expensive accelerators, using low-rank adapters. However, it is still unknown whether low-rank adapters can be exploited to control LLMs. To address this gap, we demonstrate that an infected adapter can induce, on specific triggers,an LLM to output content defined by an adversary and to even maliciously use tools. To train a Trojan adapter, we propose two novel attacks, POLISHED and FUSION, that improve over prior approaches. POLISHED uses a superior LLM to align naïvely poisoned data based on our insight that it can better inject poisoning knowledge during training. In contrast, FUSION leverages a novel over-poisoning procedure to transform a benign adapter into a malicious one by magnifying the attention between trigger and target in model weights. In our experiments, we first conduct two case studies to demonstrate that a compromised LLM agent can use malware to control the system (e.g., a LLM-driven robot) or to launch a spear-phishing attack. Then, in terms of targeted misinformation, we show that our attacks provide higher attack effectiveness than the existing baseline and, for the purpose of attracting downloads, preserve or improve the adapter's utility. Finally, we designed and evaluated three potential defenses. However, none proved entirely effective in safeguarding against our attacks, highlighting the need for more robust defenses supporting a secure LLM supply chain.

The Philosopher's Stone: Trojaning Plugins of Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 15 figures, 11 tables)

This paper contains 16 sections, 5 equations, 15 figures, 11 tables.

Introduction
Background
Threat Model
Attack Methodology
Overview
polished Attack: Teacher LLM-based Approach
fusion Attack: Over-poisoning based Approach
Evaluation
Setup
Malicious Tool Usage
Targeted Misinformation
Defense Evaluation
Related Work
Discussion and Conclusion
Details of Human Evaluation
...and 1 more sections

Figures (15)

Figure 1: Overview of (a) conventional fine-tuning and (b) fine-tuning with LoRA on one layer of weight matrix $W$.
Figure 2: An example of tool usage by an LLM agent.
Figure 3: Overview of AI pipeline including foundation model development, training and deployment ai_foundation.
Figure 4: Comparison of our polished and fusion attacks with the baseline.
Figure 5: Sketch of adapter's attention level and optimization space for (a) the training-based attack and (b) our fusion attack. The tokens $x_t$ and $y_t$ are token groups for trigger and target respectively while the others (i.e., $(x_{i_1},x_{j_1})$ and $(y_{i_2},y_{j_2})$) are clean token groups.
...and 10 more figures

The Philosopher's Stone: Trojaning Plugins of Large Language Models

TL;DR

Abstract

The Philosopher's Stone: Trojaning Plugins of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)