Table of Contents
Fetching ...

Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

Dong Liu, Yanxuan Yu

TL;DR

This work tackles the mismatch between user query granularity and single-resolution vector databases in Retrieval-Augmented Generation. It introduces Semantic Pyramid Indexing (SPI), a distributed, multi-resolution embedding framework with a query-adaptive controller that enables progressive coarse-to-fine retrieval and preserves semantic consistency across levels. Empirically, SPI delivers up to 5.7× faster retrieval and up to 1.8× memory savings, while achieving improvements in end-to-end QA quality across text and multimodal benchmarks; a three-level pyramid is identified as the sweet spot. The approach is compatible with existing VecDB backends (FAISS, Qdrant) and demonstrates near-linear scaling in distributed settings, making it ready for production RAG deployments with significant efficiency gains. The work provides theoretical guarantees on recall and semantic stability, extensive ablations validating each component, and practical insights for deployment and robustness across domains and modalities.

Abstract

Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework's compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI\_VecDB}.

Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

TL;DR

This work tackles the mismatch between user query granularity and single-resolution vector databases in Retrieval-Augmented Generation. It introduces Semantic Pyramid Indexing (SPI), a distributed, multi-resolution embedding framework with a query-adaptive controller that enables progressive coarse-to-fine retrieval and preserves semantic consistency across levels. Empirically, SPI delivers up to 5.7× faster retrieval and up to 1.8× memory savings, while achieving improvements in end-to-end QA quality across text and multimodal benchmarks; a three-level pyramid is identified as the sweet spot. The approach is compatible with existing VecDB backends (FAISS, Qdrant) and demonstrates near-linear scaling in distributed settings, making it ready for production RAG deployments with significant efficiency gains. The work provides theoretical guarantees on recall and semantic stability, extensive ablations validating each component, and practical insights for deployment and robustness across domains and modalities.

Abstract

Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7} retrieval speedup and \textbf{1.8} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework's compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI\_VecDB}.

Paper Structure

This paper contains 22 sections, 9 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Semantic Pyramid Indexing (SPI) Design in VecDBs
  • Figure 2: Distributed Semantic Pyramid Indexing (SPI): multi-resolution semantic hierarchy with adaptive query depth and distributed parallel retrieval.
  • Figure 3: End-to-end SPI workflow: progressive encoding, distributed retrieval, adaptive control, and aggregation.
  • Figure 4: Illustration of distributed parallel retrieval and adaptive depth pruning in SPI.
  • Figure 5: Detailed analysis of SPI's performance characteristics. (a) Component ablation analysis showing the contribution of each system component. (b) Distribution analysis of query processing times, memory usage, and accuracy across different query types.