Table of Contents
Fetching ...

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

Yifan Xu, Vipul Gupta, Rohit Aggarwal, Varsha Mahadevan, Bhaskar Krishnamachari

TL;DR

This work tackles the challenge of fixed retrieval depth in Retrieval-Augmented Generation by introducing Cluster-based Adaptive Retrieval (CAR), which uses clustering on ordered query-document similarity distances to determine an adaptive retrieval cutoff. CAR operates in three phases—initial retrieval, cluster‑based grouping, and a boundary-gap cutoff—with a silhouette-guided hyperparameter search and a position-aware score to avoid premature cutoffs. In production on Coinbase CDP and on the MultiHop-RAG benchmark, CAR achieves superior efficiency and accuracy, reducing token usage and latency while lowering hallucinations, and it improves user engagement in real-world deployments. The approach is robust across multiple clustering backbones and embedding spaces, offering a practical, scalable dynamic retrieval mechanism for diverse RAG deployments.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

TL;DR

This work tackles the challenge of fixed retrieval depth in Retrieval-Augmented Generation by introducing Cluster-based Adaptive Retrieval (CAR), which uses clustering on ordered query-document similarity distances to determine an adaptive retrieval cutoff. CAR operates in three phases—initial retrieval, cluster‑based grouping, and a boundary-gap cutoff—with a silhouette-guided hyperparameter search and a position-aware score to avoid premature cutoffs. In production on Coinbase CDP and on the MultiHop-RAG benchmark, CAR achieves superior efficiency and accuracy, reducing token usage and latency while lowering hallucinations, and it improves user engagement in real-world deployments. The approach is robust across multiple clustering backbones and embedding spaces, offering a practical, scalable dynamic retrieval mechanism for diverse RAG deployments.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.

Paper Structure

This paper contains 12 sections, 7 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples illustrating varying query complexity in CDP documentation search: (top-left) simple query answered by one API doc; (bottom-left) moderate query needing multiple docs to clarify product relationships; (right) complex query requiring synthesis across diverse sources.
  • Figure 2: Visualization of CAR’s cutoff mechanism: it adaptively selects retrieval thresholds (e.g., 3, 7, 11) based on clustering patterns in the embedding space, preserving relevant documents and filtering out weaker ones. The clustering outcome—shaped by query complexity and document characteristics—guides the cutoff decision.
  • Figure 3: Week-over-week growth in CDP search queries after deploying the AI summary system, showing increasing user engagement (week of January 13-19 is taken as reference).
  • Figure 4: Overview of the CDP RAG system pipeline, from query input to response generation. It employs multiple LLM calls for intent detection, retrieval, guardrails, answer generation, and quality checks, ensuring accuracy and trust. The CAR algorithm enhances search result relevance.
  • Figure 5: CAR dynamically adjusts the number of retrieved references based on query complexity, unlike fixed retrieval methods.