RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
TL;DR
RARE tackles robustness evaluation for retrieval-augmented generation in time-sensitive domains by introducing a unified framework and dynamic benchmark. It combines RARE-Get (knowledge-graph–driven data synthesis), RARE-Set (527 documents and 48,295 questions across finance, economics, and policy), and RARE-Met (retrieval-aware robustness metrics) to stress-test queries and documents under systematic perturbations and real-world retrieval. The framework reveals that contemporary RAG systems are fragile under perturbations, with multi-hop reasoning notably less robust than single-hop and with domain-specific variations influencing performance. By enabling automatic, evolution-friendly benchmark generation, RARE aims to drive development of more robust, reliable RAG systems for real-world, time-sensitive applications.
Abstract
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.
