Table of Contents
Fetching ...

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona

TL;DR

It is shown that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

TL;DR

It is shown that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.
Paper Structure (42 sections, 3 equations, 18 figures, 2 tables)

This paper contains 42 sections, 3 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Top: We propose five knowledge profiles that characterize facts. Bottom: Percentages of these profiles across selected LLMs, revealing: (1) Scaling fills "empty shelves" by reducing encoding failures: frontier LLMs encode nearly all facts in our data. (2) Recall failures remain abundant despite scaling, leaving substantial room for improvement. (3) Thinking acts as a recovery mechanism of facts that would otherwise remain "lost".
  • Figure 2: Top: We extract facts from Wikipedia, a predominant source of pre-training data. Left: We measure encoding by prompting the LLM to reproduce facts within their original context, testing whether they are stored in the model's parameters. Right: We measure knowledge by asking questions across varied phrasings and relational directions, with and without thinking.
  • Figure 3: The WikiProfile Creation Pipeline: We propose a fully automated pipeline based on prompted LLMs. The yellow boxes denote the specific prompts used at each step (see §\ref{['app_sec:prompts']}). Left (purple): Fact extraction and construction of the proposition completion task. Center (red and blue): Construction of direct and reverse questions via generation, refinement, and filtering (grounded by Google Search). Right (green): Creation of remaining questions (natural phrasing, contextual, and multiple-choice versions) based on the direct/reverse pairs. Facts are discarded throughout the pipeline if their associated questions are rejected. Full details in Appendix \ref{['sec:dataset']}.
  • Figure 4: Knowledge Profiles: Distribution of the five profiles across 13 LLMs (percentages). The black line marks potential knowledge: the fraction of facts known with or without thinking ('Direct Recall'+'Recall with Thinking'+'Inference without Encoding'). As shown, encoding failures decrease sharply with scale, while recall failures persist even in frontier models.
  • Figure 5: Fact Popularity: We compare two popularity tiers (bottom 20% vs. top 20%) in terms of encoding rates and direct recall rates (knowing encoded facts without thinking). The $\Delta$ indicates the gap between tiers. As shown, it is narrow for encoding but wide for recall. Figure \ref{['fig:popularity_appendix']} presents all LLMs.
  • ...and 13 more figures