RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis
Xue Tan, Hao Luan, Mingyu Luo, Xiaoyan Sun, Ping Chen, Jun Dai
TL;DR
RevPRAG addresses RAG poisoning by leveraging LLM final-token activation patterns to distinguish poisoned from correct outputs. It trains a lightweight triplet CNN on normalized activations collected from poisoned and clean generation, enabling real-time, non-intrusive detection across multiple RAG configurations. The approach yields near-ideal detection metrics (≈98% TPR, ≈1% FPR) and demonstrates robustness to poisoning strategies, open-ended questions, and natural textual noise, while offering efficiency advantages over existing hallucination-focused detectors. Limitations include the need for white-box access to LLM activations and a focus on binary poisoned-vs-clean classification rather than broader threat categorization, suggesting avenues for future work in generalization and finer-grained threat analysis.
Abstract
Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%.
