Fantasy: Efficient Large-scale Vector Search on GPU Clusters with GPUDirect Async
Yi Liu, Chen Qian
TL;DR
Fantasy addresses the bottleneck of large-scale vector search on GPU clusters by partitioning both the index graph and vectors across GPUs and coupling K-means-based routing with GPUDirect Async RDMA. The system executes a four-stage, in-HBM workflow with a two-microbatch pipeline to overlap computation and communication, achieving high per-GPU throughput for large batch sizes. Through analytic modeling of stage latencies—sub-ms local K-means, a few-millisecond dispatch, tens of milliseconds for parallel search, and ~10 ms for result consolidation—the approach demonstrates the potential to sustain billion-scale vector search on GPU clusters. This has practical impact for retrieval-augmented generation, semantic search, and other AI workloads requiring fast, recall-strong nearest-neighbor retrieval at scale without incurring CPU or SSD bottlenecks.
Abstract
Vector similarity search has become a critical component in AI-driven applications such as large language models (LLMs). To achieve high recall and low latency, GPUs are utilized to exploit massive parallelism for faster query processing. However, as the number of vectors continues to grow, the graph size quickly exceeds the memory capacity of a single GPU, making it infeasible to store and process the entire index on a single GPU. Recent work uses CPU-GPU architectures to keep vectors in CPU memory or SSDs, but the loading step stalls GPU computation. We present Fantasy, an efficient system that pipelines vector search and data transfer in a GPU cluster with GPUDirect Async. Fantasy overlaps computation and network communication to significantly improve search throughput for large graphs and deliver large query batch sizes.
