Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

Nathan Kuissi; Suraj Subrahmanyan; Nandan Thakur; Jimmy Lin

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

Nathan Kuissi, Suraj Subrahmanyan, Nandan Thakur, Jimmy Lin

TL;DR

This work investigates how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains, and suggests that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation.

Abstract

Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall $τ$ at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh-stack/driftbench.

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

TL;DR

Abstract

at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh-stack/driftbench.

Paper Structure (12 sections, 3 figures, 2 tables)

This paper contains 12 sections, 3 figures, 2 tables.

Introduction
Background and Related Work
Experimental Setup & Details
Corpus Preparation
Nugget Generation
Oracle Retrieval
Nugget-Level Assessment
Experimental Results & Analysis
Temporal Support of Queries
Corpus Temporal Changes
Benchmark Analysis
Conclusion & Future Work

Figures (3)

Figure 1: An illustration of the distribution of relevant documents (in %) by each GitHub repository for 2024 and 2025.
Figure 2: Source distribution shift for LangChain query 75864073 between 2024 and 2025 corpora snapshots.
Figure 3: UnstructuredURLLoader class migrated for LangChain query 75864073 from LangChain (2024) and integrated into LlamaIndex (2025).

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

TL;DR

Abstract

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)