Table of Contents
Fetching ...

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov

TL;DR

FreshStack presents a scalable framework to build realistic IR and RAG benchmarks by leveraging real user questions from Stack Overflow and up-to-date GitHub technical documents. Its five-stage pipeline—topic-correct question selection, per-topic corpus construction, nugget generation with GPT-4o, judgment pool creation via ensemble retrieval, and nugget-level support assessment—produces challenging, contamination-resistant datasets across five niche domains. Across retrieval and RAG experiments, oracle contexts substantially outperform out-of-the-box models, with ensemble methods offering the strongest gains and context quality proving critical for RAG accuracy. The work highlights significant headroom for improving retrieval in niche domains and provides a practical testbed to drive future advancements in IR and RAG evaluation benchmarks.

Abstract

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

TL;DR

FreshStack presents a scalable framework to build realistic IR and RAG benchmarks by leveraging real user questions from Stack Overflow and up-to-date GitHub technical documents. Its five-stage pipeline—topic-correct question selection, per-topic corpus construction, nugget generation with GPT-4o, judgment pool creation via ensemble retrieval, and nugget-level support assessment—produces challenging, contamination-resistant datasets across five niche domains. Across retrieval and RAG experiments, oracle contexts substantially outperform out-of-the-box models, with ensemble methods offering the strongest gains and context quality proving critical for RAG accuracy. The work highlights significant headroom for improving retrieval in niche domains and provides a practical testbed to drive future advancements in IR and RAG evaluation benchmarks.

Abstract

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

Paper Structure

This paper contains 29 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: A data instance from LangChain generated with FreshStack. The question and answer pair is sourced from Stack Overflow. The pair is provided to GPT-4o to generate nuggets, highlighting necessary facts in the answer. Next, code snippets and technical documents from multiple GitHub repositories (e.g., Jupyter Notebook) are chunked, processed, and pooled for each question. Finally, each pooled document chunk is judged with GPT-4o for binary relevance (either yes or no) at a nugget-level, i.e., whether the document factually supports the information present in each nugget.
  • Figure 2: The FreshStack framework: (1) Stack Overflow questions and answers are sourced for recent and niche topics. (2) GitHub repository documents are collected and chunked to form the corpus (for each topic). (3) Nuggets or atomic facts within the question and answer are generated with GPT-4o. (4) Ensemble techniques and models retrieve documents, which construct our document judgment pools. (5) GPT-4o evaluates support for every document-nugget pair as a binary judgment.
  • Figure 3: Timeline versus frequency of how many FreshStack queries were asked on Stack Overflow in every quarter. All queries included in FreshStack were asked between January 2023 and June 2024, with the highest frequencies observed in 2024, showing the growing importance of all five topics.
  • Figure 4: Token distribution of Stack Overflow questions and answers for all topics in FreshStack. Unlike other benchmarks, FreshStack questions (highlighted in green) are much longer than their answers (highlighted in maroon).