Table of Contents
Fetching ...

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, Gongshen Liu

TL;DR

TrojanRAG presents a novel backdoor paradigm that corrupts Retrieval-Augmented Generation pipelines by poisoning the retriever’s context with triggers and poisoned knowledge. Through a four-part design—trigger setting, poisoned-context generation, knowledge-graph augmentation, and orthogonal multi-objective optimization—the framework creates multiple, non-interfering backdoors while preserving normal retrieval. Empirical results across fact-checking, classification, harmful-bias, and jailbreaking scenarios show strong attack efficacy, transferability, and compatibility with Chain-of-Thought, with notable retention of retrieval quality. The work highlights urgent defense needs for RAG-based LLM services and motivates future work on anomaly detection and robust retrieval safeguards.

Abstract

Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

TL;DR

TrojanRAG presents a novel backdoor paradigm that corrupts Retrieval-Augmented Generation pipelines by poisoning the retriever’s context with triggers and poisoned knowledge. Through a four-part design—trigger setting, poisoned-context generation, knowledge-graph augmentation, and orthogonal multi-objective optimization—the framework creates multiple, non-interfering backdoors while preserving normal retrieval. Empirical results across fact-checking, classification, harmful-bias, and jailbreaking scenarios show strong attack efficacy, transferability, and compatibility with Chain-of-Thought, with notable retention of retrieval quality. The work highlights urgent defense needs for RAG-based LLM services and motivates future work on anomaly detection and robust retrieval safeguards.

Abstract

Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.
Paper Structure (20 sections, 12 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 12 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of the attack objective and influence of TrojanRAG in three scenarios: (1) The attacker utilizes all triggers, especially robust triggers to proactive manipulate LLMs' generation; (2) The user becomes an unintentional passive participant or victim of attack; (3) All users may try to jailbreak LLMs, leading to safety degradation.
  • Figure 2: TrojanRAG overview of implantation and activation.
  • Figure 3: Harmful bias and side effects of TrojanRAG on LLMs in left sub_figures (a-b), and Backdoor-style jailbreaking impacts of TrojanRAG in right sub_figures (c-d) across five LLMs.
  • Figure 4: Orthogonal Visualisation of TrojanRAG in NQ.
  • Figure 5: Performance of context retrieved from knowledge database in scenarios 1 (Attacker) and 2 (User), including clean query and poison query in TrojanRAG and the comparison to CleanRAG (Other Tasks are deferred to Appendix \ref{['fig:retriever_appendix_attacker']}).
  • ...and 9 more figures