Practical RAG Evaluation: A Rarity-Aware Set-Based Metric and Cost-Latency-Quality Trade-offs
Etienne Dallaire
TL;DR
This work reframes production retrieval-augmented generation (RAG) evaluation as set-based evidence presence under a fixed budget, arguing that traditional rank-centric metrics do not capture the end-to-end utility when a generator consumes a top-$K$ passage set. It introduces RA-nWG@$K$, a rarity-aware, per-query-normalized set score, together with pool ceilings via PROC and the percentage of PROC (%PROC), all within a Cost-Latency-Quality (CLQ) framework. A lean golden-set pipeline rag-gs is proposed to enable reproducible, iterative, listwise refinement of candidate pools, while a comprehensive benchmark on a production scientific-papers corpus assesses dense, hybrid, embedding, reranking, ANN, and quantization configurations. The paper also develops targeted diagnostics for proper-name identity signals and conversational-noise sensitivity, linking these diagnostics to end-to-end recall and offering practical guardrails for SLA-aware deployment. Together, these contributions deliver actionable Pareto guidance for production RAG with auditable, reproducible evaluation across stacks and budgets.
Abstract
This paper addresses the guessing game in building production RAG. Classical rank-centric IR metrics (nDCG/MAP/MRR) are a poor fit for RAG, where LLMs consume a set of passages rather than a browsed list; position discounts and prevalence-blind aggregation miss what matters: whether the prompt at cutoff K contains the decisive evidence. Second, there is no standardized, reproducible way to build and audit golden sets. Third, leaderboards exist but lack end-to-end, on-corpus benchmarking that reflects production trade-offs. Fourth, how state-of-the-art embedding models handle proper-name identity signals and conversational noise remains opaque. To address these, we contribute: (1) RA-nWG@K, a rarity-aware, per-query-normalized set score, and operational ceilings via the pool-restricted oracle ceiling (PROC) and the percentage of PROC (%PROC) to separate retrieval from ordering headroom within a Cost-Latency-Quality (CLQ) lens; (2) rag-gs (MIT), a lean golden-set pipeline with Plackett-Luce listwise refinement whose iterative updates outperform single-shot LLM ranking; (3) a comprehensive benchmark on a production RAG (scientific-papers corpus) spanning dense retrieval, hybrid dense+BM25, embedding models and dimensions, cross-encoder rerankers, ANN (HNSW), and quantization; and (4) targeted diagnostics that quantify proper-name identity signal and conversational-noise sensitivity via identity-destroying and formatting ablations. Together, these components provide practitioner Pareto guidance and auditable guardrails to support reproducible, budget/SLA-aware decisions.
