Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

Yifan Xu; Vipul Gupta; Rohit Aggarwal; Varsha Mahadevan; Bhaskar Krishnamachari

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

Yifan Xu, Vipul Gupta, Rohit Aggarwal, Varsha Mahadevan, Bhaskar Krishnamachari

TL;DR

This work tackles the challenge of fixed retrieval depth in Retrieval-Augmented Generation by introducing Cluster-based Adaptive Retrieval (CAR), which uses clustering on ordered query-document similarity distances to determine an adaptive retrieval cutoff. CAR operates in three phases—initial retrieval, cluster‑based grouping, and a boundary-gap cutoff—with a silhouette-guided hyperparameter search and a position-aware score to avoid premature cutoffs. In production on Coinbase CDP and on the MultiHop-RAG benchmark, CAR achieves superior efficiency and accuracy, reducing token usage and latency while lowering hallucinations, and it improves user engagement in real-world deployments. The approach is robust across multiple clustering backbones and embedding spaces, offering a practical, scalable dynamic retrieval mechanism for diverse RAG deployments.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

TL;DR

Abstract

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)