Table of Contents
Fetching ...

Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

Chihiro Taguchi, Seiji Maekawa, Nikita Bhutani

TL;DR

Adaptive-$k$ introduces a single-pass, plug-and-play retrieval method for long-context QA that selects the number of passages by locating the largest gap in the distribution of query-to-passage similarities, enabling per-query context sizing without tuning. It achieves substantial token reductions (up to 99% in factoid QA and 2x–10x in aggregation QA) while maintaining or improving accuracy across multiple LCLMs and embedding models. The approach is validated on HELMET and HoloBench benchmarks, showing robust gains across diverse models and demonstrating that dynamic context sizing improves both efficiency and answer quality in open-domain QA. Overall, Adaptive-$k$ offers a practical, model-agnostic alternative to fixed retrieval budgets and iterative adaptive methods, suitable for API-based deployments and large-scale QA systems.

Abstract

Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive-$k$ retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.

Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

TL;DR

Adaptive- introduces a single-pass, plug-and-play retrieval method for long-context QA that selects the number of passages by locating the largest gap in the distribution of query-to-passage similarities, enabling per-query context sizing without tuning. It achieves substantial token reductions (up to 99% in factoid QA and 2x–10x in aggregation QA) while maintaining or improving accuracy across multiple LCLMs and embedding models. The approach is validated on HELMET and HoloBench benchmarks, showing robust gains across diverse models and demonstrating that dynamic context sizing improves both efficiency and answer quality in open-domain QA. Overall, Adaptive- offers a practical, model-agnostic alternative to fixed retrieval budgets and iterative adaptive methods, suitable for API-based deployments and large-scale QA systems.

Abstract

Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive- retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive- matches or outperforms fixed- baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.

Paper Structure

This paper contains 35 sections, 1 equation, 5 figures, 18 tables, 1 algorithm.

Figures (5)

  • Figure 1: Example distributions of sorted cosine similarities from the long-context version of HotpotQA yang-etal-2018-hotpotqa included in HELMET yen2025helmetevaluatelongcontextlanguage with 1,000 context documents (top) and HoloBench maekawa2025holisticreasoninglongcontextlms with 10% relevant information amount (bottom). BAAI's bge-large-en-v1.5 is used as the embedding model.
  • Figure 2: The proposed method in the RAG workflow. The method chooses the threshold k for retrieval based on a large gap in the sorted similarity score distribution.
  • Figure 3: The results with different amounts of relevant information in the HoloBench tasks. The best-performing fixed-n setting is chosen as the oracle. is for performance improvement, and for the number of input tokens.
  • Figure 4: A performance comparison of our proposed method (Adaptive-$k$) in the factoid QA tasks against existing methods. The embedding model is bge-large-en-v1.5, and the reader model is GPT-4o. is for the SubEM scores, and for the number of input tokens.
  • Figure 5: A performance comparison across the different reader models in the HoloBench task. The emnbedding model is bge-large-en-v1.5. is for performance improvement, and for the number of input tokens.