Progressive Searching for Retrieval in RAG
Taehee Jeong, Xingzhe Zhao, Peizu Li, Markus Valvur, Weihua Zhao
TL;DR
Retrieval-Augmented Generation (RAG) aims to mitigate outdated information and hallucinations by integrating external documents into LLM prompts. The paper introduces Progressive Retrieval, a multi-stage KNN search that starts from low-dimensional embeddings and progressively expands to the target dimensionality, reducing query time while preserving high accuracy. Experiments with OpenAI's text-embedding-3-large and Alibaba-NLP's gte-Qwen2-7B-instruct demonstrate substantial speedups (up to ~5x) at high dimensions while maintaining comparable Top-1 accuracy, enabling scalable real-time RAG pipelines. These findings guide practical design choices for embedding dimensionality and candidate filtering in large-scale RAG systems.
Abstract
Retrieval Augmented Generation (RAG) is a promising technique for mitigating two key limitations of large language models (LLMs): outdated information and hallucinations. RAG system stores documents as embedding vectors in a database. Given a query, search is executed to find the most related documents. Then, the topmost matching documents are inserted into LLMs' prompt to generate a response. Efficient and accurate searching is critical for RAG to get relevant information. We propose a cost-effective searching algorithm for retrieval process. Our progressive searching algorithm incrementally refines the candidate set through a hierarchy of searches, starting from low-dimensional embeddings and progressing into a higher, target-dimensionality. This multi-stage approach reduces retrieval time while preserving the desired accuracy. Our findings demonstrate that progressive search in RAG systems achieves a balance between dimensionality, speed, and accuracy, enabling scalable and high-performance retrieval even for large databases.
