Table of Contents
Fetching ...

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Mengzhao Wang, Weizhi Xu, Xiaomeng Yi, Songlin Wu, Zhangyang Peng, Xiangyu Ke, Yunjun Gao, Xiaoliang Xu, Rentong Guo, Charles Xie

TL;DR

Starling addresses high-dimensional vector similarity search on data segments with stringent memory and disk constraints by introducing an I/O-efficient disk-resident graph index. It synergistically combines a memory-based in-memory navigation graph with a reordered on-disk graph and a block-oriented search strategy to minimize disk I/O while preserving accuracy, achieving substantial throughput gains and latency reductions. The work proves the NP-hardness of optimal block shuffling and offers practical heuristics, demonstrating up to 43.9x higher throughput and 98% lower latency than state-of-the-art baselines on real datasets, and scalability to billion-scale data on a single machine. The approach is broadly applicable to existing graph indices and offers a practical path for scalable HVSS in disk-based vector databases, with planned extensions to caching, GPUs, and Milvus integration.

Abstract

High-dimensional vector similarity search (HVSS) is gaining prominence as a powerful tool for various data science and AI applications. As vector data scales up, in-memory indexes pose a significant challenge due to the substantial increase in main memory requirements. A potential solution involves leveraging disk-based implementation, which stores and searches vector data on high-performance devices like NVMe SSDs. However, implementing HVSS for data segments proves to be intricate in vector databases where a single machine comprises multiple segments for system scalability. In this context, each segment operates with limited memory and disk space, necessitating a delicate balance between accuracy, efficiency, and space cost. Existing disk-based methods fall short as they do not holistically address all these requirements simultaneously. In this paper, we present Starling, an I/O-efficient disk-resident graph index framework that optimizes data layout and search strategy within the segment. It has two primary components: (1) a data layout incorporating an in-memory navigation graph and a reordered disk-based graph with enhanced locality, reducing the search path length and minimizing disk bandwidth wastage; and (2) a block search strategy designed to minimize costly disk I/O operations during vector query execution. Through extensive experiments, we validate the effectiveness, efficiency, and scalability of Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with over 0.9 average precision and top-10 recall rate, and latency under 1 millisecond. The results showcase Starling's superior performance, exhibiting 43.9$\times$ higher throughput with 98% lower query latency compared to state-of-the-art methods while maintaining the same level of accuracy.

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

TL;DR

Starling addresses high-dimensional vector similarity search on data segments with stringent memory and disk constraints by introducing an I/O-efficient disk-resident graph index. It synergistically combines a memory-based in-memory navigation graph with a reordered on-disk graph and a block-oriented search strategy to minimize disk I/O while preserving accuracy, achieving substantial throughput gains and latency reductions. The work proves the NP-hardness of optimal block shuffling and offers practical heuristics, demonstrating up to 43.9x higher throughput and 98% lower latency than state-of-the-art baselines on real datasets, and scalability to billion-scale data on a single machine. The approach is broadly applicable to existing graph indices and offers a practical path for scalable HVSS in disk-based vector databases, with planned extensions to caching, GPUs, and Milvus integration.

Abstract

High-dimensional vector similarity search (HVSS) is gaining prominence as a powerful tool for various data science and AI applications. As vector data scales up, in-memory indexes pose a significant challenge due to the substantial increase in main memory requirements. A potential solution involves leveraging disk-based implementation, which stores and searches vector data on high-performance devices like NVMe SSDs. However, implementing HVSS for data segments proves to be intricate in vector databases where a single machine comprises multiple segments for system scalability. In this context, each segment operates with limited memory and disk space, necessitating a delicate balance between accuracy, efficiency, and space cost. Existing disk-based methods fall short as they do not holistically address all these requirements simultaneously. In this paper, we present Starling, an I/O-efficient disk-resident graph index framework that optimizes data layout and search strategy within the segment. It has two primary components: (1) a data layout incorporating an in-memory navigation graph and a reordered disk-based graph with enhanced locality, reducing the search path length and minimizing disk bandwidth wastage; and (2) a block search strategy designed to minimize costly disk I/O operations during vector query execution. Through extensive experiments, we validate the effectiveness, efficiency, and scalability of Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with over 0.9 average precision and top-10 recall rate, and latency under 1 millisecond. The results showcase Starling's superior performance, exhibiting 43.9 higher throughput with 98% lower query latency compared to state-of-the-art methods while maintaining the same level of accuracy.
Paper Structure (48 sections, 2 theorems, 11 equations, 26 figures, 22 tables, 3 algorithms)

This paper contains 48 sections, 2 theorems, 11 equations, 26 figures, 22 tables, 3 algorithms.

Key Result

theorem 1

The block shuffling problem is NP-hard and does not have a polynomial time approximation algorithm with a finite approximation factor unless P=NP.

Figures (26)

  • Figure 1: Two indexing strategies on a single machine for vector database system. In a distributed setting, different machines share the same strategy Manu_zilliz.
  • Figure 2: Illustration of the data layouts and search strategies for the baseline and $\mathtt{Starling}$, respectively.
  • Figure 3: Block shuffling. Refer to Fig. \ref{['fig: overview_graph_layout']} for graph topology.
  • Figure 4: RS performance (AP vs Latency).
  • Figure 5: RS performance (QPS vs AP).
  • ...and 21 more figures

Theorems & Definitions (10)

  • Example 1
  • Definition 1
  • Example 2
  • Example 3
  • Definition 2
  • theorem 1
  • Example 4
  • Example 5
  • Example 6
  • lemma 1