Table of Contents
Fetching ...

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li

TL;DR

RARE tackles robustness evaluation for retrieval-augmented generation in time-sensitive domains by introducing a unified framework and dynamic benchmark. It combines RARE-Get (knowledge-graph–driven data synthesis), RARE-Set (527 documents and 48,295 questions across finance, economics, and policy), and RARE-Met (retrieval-aware robustness metrics) to stress-test queries and documents under systematic perturbations and real-world retrieval. The framework reveals that contemporary RAG systems are fragile under perturbations, with multi-hop reasoning notably less robust than single-hop and with domain-specific variations influencing performance. By enabling automatic, evolution-friendly benchmark generation, RARE aims to drive development of more robust, reliable RAG systems for real-world, time-sensitive applications.

Abstract

Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

TL;DR

RARE tackles robustness evaluation for retrieval-augmented generation in time-sensitive domains by introducing a unified framework and dynamic benchmark. It combines RARE-Get (knowledge-graph–driven data synthesis), RARE-Set (527 documents and 48,295 questions across finance, economics, and policy), and RARE-Met (retrieval-aware robustness metrics) to stress-test queries and documents under systematic perturbations and real-world retrieval. The framework reveals that contemporary RAG systems are fragile under perturbations, with multi-hop reasoning notably less robust than single-hop and with domain-specific variations influencing performance. By enabling automatic, evolution-friendly benchmark generation, RARE aims to drive development of more robust, reliable RAG systems for real-world, time-sensitive applications.

Abstract

Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.

Paper Structure

This paper contains 36 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration for the RARE framework. Red frame: data that pipeline will generate; Black frame: process/movement.
  • Figure 2: Examples of the multi-hop questions. Blue: triplets traversed from KG; Peach: generated question; Green: generated answer; Red: "bridge" entity which connect different triplets together;
  • Figure 3: Three types of document perturbations measured by two relevances.
  • Figure 4: Relationship between the sizes of open-source generators and their robustness scores across various categories. Generally, larger generator sizes correspond to higher robustness scores. However, for Qwen 3 models, robustness scores tend to stay closely across difference parameter sizes
  • Figure 5: Pairwise relationship between query, document and retrieval robustness. All of these models achieve the balanced robustness across query, document, and retrieval dimensions, while Qwen3 models cluster tightly in the upper-right corner, indicating consistently strong robustness across categories. In contrast, Llama models are more spread out, with smaller ones performing poorly and larger ones improving in document and retrieval robustness but still lagging in query robustness.
  • ...and 4 more figures