With Argus Eyes: Assessing Retrieval Gaps via Uncertainty Scoring to Detect and Remedy Retrieval Blind Spots
Zeinab Sadat Taghavi, Ali Modarressi, Hinrich Schutze, Andreas Marfurt
TL;DR
neural retrievers in retrieval-augmented generation exhibit geometry-driven blind spots where relevant entities fail to surface within a practical top-$k$ budget. The authors define Retrieval Probability Score ($\text{RPS}_k$) using Wikidata-Wikipedia alignments and neutral pools to quantify entity-level retrievability, and show that embedding geometry encodes this retrievability, enabling lightweight probes to predict $\widehat{\text{RPS}}_k$ from embeddings. They present ARGUS, an offline, pre-index remediation pipeline that diagnoses high-risk entities and augments their context from a Reference KB via Document Expansion or KB-guided LLM Synthesis, yielding consistent end-to-end gains across multiple retrievers and benchmarks without retraining. These findings offer a practical path toward more robust, trustworthy RAG systems by auditing at indexing time and applying targeted, domain-agnostic evidence augmentation.
Abstract
Reliable retrieval-augmented generation (RAG) systems depend fundamentally on the retriever's ability to find relevant information. We show that neural retrievers used in RAG systems have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding. We investigate the training-induced biases that cause such blind spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability. Using a large-scale dataset constructed from Wikidata relations and first paragraphs of Wikipedia, and our proposed Retrieval Probability Score (RPS), we show that blind spot risk in standard retrievers (e.g., CONTRIEVER, REASONIR) can be predicted pre-index from entity embedding geometry, avoiding expensive retrieval evaluations. To address these blind spots, we introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case. Extensive experiments on BRIGHT, IMPLIRET, and RAR-B show that ARGUS achieves consistent improvements across all evaluated retrievers (averaging +3.4 nDCG@5 and +4.5 nDCG@10 absolute points), with substantially larger gains in challenging subsets. These results establish that preemptively remedying blind spots is critical for building robust and trustworthy RAG systems (Code and Data).
