Table of Contents
Fetching ...

Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

Sanchit Pandey

Abstract

Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.

Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

Abstract

Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.
Paper Structure (59 sections, 1 equation, 5 figures, 10 tables)

This paper contains 59 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Retrieval utilization across model scale. Left: For Unknown questions, oracle retrieval achieves at most 14.6% EM at 7B, meaning 85%+ of retrieval effort is wasted. Right: For Known questions, all retrieval methods destroy 42--64% of previously correct answers. Error bars are 95% bootstrap CIs.
  • Figure 2: The distraction effect: percentage of Known answers destroyed by adding retrieval context. For models ${\leq}$3B, oracle and noisy retrieval cause statistically indistinguishable harm---context presence, not quality, drives distraction.
  • Figure 3: Oracle failure taxonomy (% of failures per category, corpus-matched only, $n{=}2{,}588$). Irrelevant generation is dominant at all scales (61--100%). Refusal is unexpectedly high at 7B (24%). Inter-annotator agreement: 89% (96 manually reviewed samples).
  • Figure 4: Oracle utilization gap: the pink region shows the fraction of retrieval effort that is wasted (answer in passage, model still fails). Even at 7B, 85% of oracle retrievals are unused.
  • Figure 5: Error category breakdown as stacked proportions. Irrelevant generation decreases as a share with scale but remains the dominant failure mode. Refusal is non-monotone across Qwen2.5 sizes, likely reflecting instruction-tuning recipe differences.