TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation
Huichi Zhou, Kin-Hei Lee, Zhonghao Zhan, Yue Chen, Zhenhao Li, Zhaoyang Wang, Hamed Haddadi, Emine Yilmaz
TL;DR
TrustRAG tackles corpus poisoning in retrieval-augmented generation with a two-stage defense. It first filters malicious content using K-means clustering and embedding-based checks, then resolves conflicts via internal knowledge extraction, knowledge consolidation, and self-assessment to choose reliable sources. Across three benchmarks and multiple models, TrustRAG consistently achieves higher accuracy and substantially lower attack success rates than prior defenses, demonstrating strong resilience to single and multi-injection attacks. The framework is plug-and-play and training-free, offering practical robustness improvements with a modest runtime trade-off.
Abstract
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.
