Table of Contents
Fetching ...

With Argus Eyes: Assessing Retrieval Gaps via Uncertainty Scoring to Detect and Remedy Retrieval Blind Spots

Zeinab Sadat Taghavi, Ali Modarressi, Hinrich Schutze, Andreas Marfurt

TL;DR

neural retrievers in retrieval-augmented generation exhibit geometry-driven blind spots where relevant entities fail to surface within a practical top-$k$ budget. The authors define Retrieval Probability Score ($\text{RPS}_k$) using Wikidata-Wikipedia alignments and neutral pools to quantify entity-level retrievability, and show that embedding geometry encodes this retrievability, enabling lightweight probes to predict $\widehat{\text{RPS}}_k$ from embeddings. They present ARGUS, an offline, pre-index remediation pipeline that diagnoses high-risk entities and augments their context from a Reference KB via Document Expansion or KB-guided LLM Synthesis, yielding consistent end-to-end gains across multiple retrievers and benchmarks without retraining. These findings offer a practical path toward more robust, trustworthy RAG systems by auditing at indexing time and applying targeted, domain-agnostic evidence augmentation.

Abstract

Reliable retrieval-augmented generation (RAG) systems depend fundamentally on the retriever's ability to find relevant information. We show that neural retrievers used in RAG systems have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding. We investigate the training-induced biases that cause such blind spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability. Using a large-scale dataset constructed from Wikidata relations and first paragraphs of Wikipedia, and our proposed Retrieval Probability Score (RPS), we show that blind spot risk in standard retrievers (e.g., CONTRIEVER, REASONIR) can be predicted pre-index from entity embedding geometry, avoiding expensive retrieval evaluations. To address these blind spots, we introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case. Extensive experiments on BRIGHT, IMPLIRET, and RAR-B show that ARGUS achieves consistent improvements across all evaluated retrievers (averaging +3.4 nDCG@5 and +4.5 nDCG@10 absolute points), with substantially larger gains in challenging subsets. These results establish that preemptively remedying blind spots is critical for building robust and trustworthy RAG systems (Code and Data).

With Argus Eyes: Assessing Retrieval Gaps via Uncertainty Scoring to Detect and Remedy Retrieval Blind Spots

TL;DR

neural retrievers in retrieval-augmented generation exhibit geometry-driven blind spots where relevant entities fail to surface within a practical top- budget. The authors define Retrieval Probability Score () using Wikidata-Wikipedia alignments and neutral pools to quantify entity-level retrievability, and show that embedding geometry encodes this retrievability, enabling lightweight probes to predict from embeddings. They present ARGUS, an offline, pre-index remediation pipeline that diagnoses high-risk entities and augments their context from a Reference KB via Document Expansion or KB-guided LLM Synthesis, yielding consistent end-to-end gains across multiple retrievers and benchmarks without retraining. These findings offer a practical path toward more robust, trustworthy RAG systems by auditing at indexing time and applying targeted, domain-agnostic evidence augmentation.

Abstract

Reliable retrieval-augmented generation (RAG) systems depend fundamentally on the retriever's ability to find relevant information. We show that neural retrievers used in RAG systems have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding. We investigate the training-induced biases that cause such blind spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability. Using a large-scale dataset constructed from Wikidata relations and first paragraphs of Wikipedia, and our proposed Retrieval Probability Score (RPS), we show that blind spot risk in standard retrievers (e.g., CONTRIEVER, REASONIR) can be predicted pre-index from entity embedding geometry, avoiding expensive retrieval evaluations. To address these blind spots, we introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case. Extensive experiments on BRIGHT, IMPLIRET, and RAR-B show that ARGUS achieves consistent improvements across all evaluated retrievers (averaging +3.4 nDCG@5 and +4.5 nDCG@10 absolute points), with substantially larger gains in challenging subsets. These results establish that preemptively remedying blind spots is critical for building robust and trustworthy RAG systems (Code and Data).
Paper Structure (66 sections, 16 equations, 8 figures, 5 tables)

This paper contains 66 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Retrieval Probability Score (RPS) computation and retriever blind-spot analysis.(Top) Evaluation pipeline: (1) construct a Wikidata--Wikipedia aligned dataset, (2) build query-specific retrieval sets with strictly disjoint neutral entities, and (3) compute $RPS$ from retrieval consistency. (Bottom) Average RPS over a large random entity sample at $k=50$ with $N=800$ neutrals (suppressing chance hits). Standard retrievers succeed only rarely (e.g., Contriever $\approx 0.11$), implying that for a random entity nearly 90% of valid top-$k$ retrieval opportunities fail.
  • Figure 2: LDA projections of entity embeddings labeled by RPS terciles (low/mid/high) at $\boldsymbol{k=50}$ under increasing neutral pool sizes $\boldsymbol{N}$, comparing a low-RPS retriever (BGE-M3) to a high-RPS retriever (ReasonIR-8B). Robust retrievers retain denser high-RPS regions (blue) as $N$ grows, indicating higher expected top-$k$ retrievability for a random entity, while persistent low-RPS regions (red) across models confirm intrinsic blind spots.
  • Figure 3: Impact of neutral pool size ($\boldsymbol{N}$) on fraction of entities with $\boldsymbol{\text{RPS}_k >0.5}$ ($\boldsymbol{k=50}$). At $N=100$, successful retrieval rates match the chance regime ($k/N \approx 0.5$). Beyond $N\ge 400$, curves decouple from chance and plateau, revealing stable, model-specific behavior. Hence, we adopt $N=800$, so that high RPS reflects genuine geometric retrievability.
  • Figure 4: The ARGUS Pipeline: Diagnosis and Remedy of Geometric Blind Spots. (A) Diagnosis: The system first extracts named entities and predicts their retrievability ($RPS_k$) using the target retriever. Entities falling below the safety threshold ($RPS < \tau$) are flagged as blind spots (high-risk) located in inaccessible regions of the embedding space. (B) Augmentation: To remedy these blind spots, ARGUS retrieves defining context from a Reference KB. We employ two strategies, (B.1) Document Expansion (Concatenation) or (B.2) LLM Synthesis, to generate augmented document views. By indexing these views alongside the original, we enable the retrievability of previously unknown entities.
  • Figure 5: Sensitivity of retrieval consistency to the retrieval window size ($\boldsymbol{k}$) at fixed $\boldsymbol{N=800}$. Increasing the user-defined parameter $k$ expands the retrieval scope. Standard retrievers (e.g., Contriever, BGE-M3) exhibit approximately linear growth consistent with statistical scaling of the random-hit window. In contrast, ReasonIR displays a non-linear trajectory with a mild elbow, indicating that its gains are driven by learned geometric structure rather than simple expansion of candidate slots.
  • ...and 3 more figures