Table of Contents
Fetching ...

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

Huichi Zhou, Kin-Hei Lee, Zhonghao Zhan, Yue Chen, Zhenhao Li, Zhaoyang Wang, Hamed Haddadi, Emine Yilmaz

TL;DR

TrustRAG tackles corpus poisoning in retrieval-augmented generation with a two-stage defense. It first filters malicious content using K-means clustering and embedding-based checks, then resolves conflicts via internal knowledge extraction, knowledge consolidation, and self-assessment to choose reliable sources. Across three benchmarks and multiple models, TrustRAG consistently achieves higher accuracy and substantially lower attack success rates than prior defenses, demonstrating strong resilience to single and multi-injection attacks. The framework is plug-and-play and training-free, offering practical robustness improvements with a modest runtime trade-off.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

TL;DR

TrustRAG tackles corpus poisoning in retrieval-augmented generation with a two-stage defense. It first filters malicious content using K-means clustering and embedding-based checks, then resolves conflicts via internal knowledge extraction, knowledge consolidation, and self-assessment to choose reliable sources. Across three benchmarks and multiple models, TrustRAG consistently achieves higher accuracy and substantially lower attack success rates than prior defenses, demonstrating strong resilience to single and multi-injection attacks. The framework is plug-and-play and training-free, offering practical robustness improvements with a modest runtime trade-off.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.
Paper Structure (63 sections, 6 equations, 9 figures, 16 tables)

This paper contains 63 sections, 6 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: The TrustRAG framework protects RAG systems from corpus poisoning attacks using a two-stage process. In Stage 1 (Clean Retrieval), it (1) identifies malicious documents via K-means clustering and (2) filters malicious content based on embedding distributions. In Stage 2 (Conflict Resolution), it (3) extracts internal knowledge to ensure accurate reasoning, (4) resolves conflicts by grouping consistent documents and discarding irrelevant or conflicting ones, and (5) generates a reliable final answer based on self-assessment.
  • Figure 2: (1) The density plot of cosine similarity between three different groups. (2) The box plot of ROUGE Score between three different groups.
  • Figure 3: We analyze the embedding distribution of retrieved documents by plotting different numbers of poisoned data, ranging from 1 to 5, where red denotes adversarial and blue indicates clean. The results show that when the number of malicious documents exceeds 2, they tend to form distinct clusters.
  • Figure 4: (1) The PPL distribution density plot between clean and malicious documents. And the lines of dashes represent the average PPL values. (2) The bar plot of ablation study on ACC in NQ based on the Llama$_{\text{3.1-8B}}$. (3) The bar plot of ablation study on ASR in NQ based on the Llama$_{\text{3.1-8B}}$.
  • Figure 5: The Real Poisoned Rate (RPR) is defined as the proportion of malicious documents injected into the database that are subsequently retrieved by the retriever. We evaluate the RPR across varying poison rates within the PoisonedRAG framework under three distinct experimental settings: (1) Diverse Poisoned Documents, (2) Original Poisoned Documents with Questions, and (3) Original Poisoned Documents without Questions.
  • ...and 4 more figures