Table of Contents
Fetching ...

LSM-VEC: A Large-Scale Disk-Based System for Dynamic Vector Search

Shurui Zhong, Dingheng Mo, Siqiang Luo

TL;DR

LSM-VEC targets scalable, dynamic, disk-based nearest neighbor search for billion-scale vector embeddings, addressing memory constraints of in-memory indices and the need for real-time updates. It integrates a graph-based proximity index with an LSM-tree storage backend, decoupling vector data from the graph and enabling out-of-place updates. Two key innovations—sampling-guided probabilistic traversal and connectivity-aware graph reordering—reduce disk I/O while preserving recall, aided by a write-friendly bottom layer and selective neighbor exploration. Empirical results on SIFT1B demonstrate higher recall, lower query and update latency, and substantially lower memory footprint compared with DiskANN and SPFresh, validating LSM-VEC as a practical solution for real-world large-scale dynamic vector search.

Abstract

Vector search underpins modern AI applications by supporting approximate nearest neighbor (ANN) queries over high-dimensional embeddings in tasks like retrieval-augmented generation (RAG), recommendation systems, and multimodal search. Traditional ANN search indices (e.g., HNSW) are limited by memory constraints at large data scale. Disk-based indices such as DiskANN reduce memory overhead but rely on offline graph construction, resulting in costly and inefficient vector updates. The state-of-the-art clustering-based approach SPFresh offers better scalability but suffers from reduced recall due to coarse partitioning. Moreover, SPFresh employs in-place updates to maintain its index structure, limiting its efficiency in handling high-throughput insertions and deletions under dynamic workloads. This paper presents LSM-VEC, a disk-based dynamic vector index that integrates hierarchical graph indexing with LSM-tree storage. By distributing the proximity graph across multiple LSM-tree levels, LSM-VEC supports out-of-place vector updates. It enhances search efficiency via a sampling-based probabilistic search strategy with adaptive neighbor selection, and connectivity-aware graph reordering further reduces I/O without requiring global reconstruction. Experiments on billion-scale datasets demonstrate that LSM-VEC consistently outperforms existing disk-based ANN systems. It achieves higher recall, lower query and update latency, and reduces memory footprint by over 66.2%, making it well-suited for real-world large-scale vector search with dynamic updates.

LSM-VEC: A Large-Scale Disk-Based System for Dynamic Vector Search

TL;DR

LSM-VEC targets scalable, dynamic, disk-based nearest neighbor search for billion-scale vector embeddings, addressing memory constraints of in-memory indices and the need for real-time updates. It integrates a graph-based proximity index with an LSM-tree storage backend, decoupling vector data from the graph and enabling out-of-place updates. Two key innovations—sampling-guided probabilistic traversal and connectivity-aware graph reordering—reduce disk I/O while preserving recall, aided by a write-friendly bottom layer and selective neighbor exploration. Empirical results on SIFT1B demonstrate higher recall, lower query and update latency, and substantially lower memory footprint compared with DiskANN and SPFresh, validating LSM-VEC as a practical solution for real-world large-scale dynamic vector search.

Abstract

Vector search underpins modern AI applications by supporting approximate nearest neighbor (ANN) queries over high-dimensional embeddings in tasks like retrieval-augmented generation (RAG), recommendation systems, and multimodal search. Traditional ANN search indices (e.g., HNSW) are limited by memory constraints at large data scale. Disk-based indices such as DiskANN reduce memory overhead but rely on offline graph construction, resulting in costly and inefficient vector updates. The state-of-the-art clustering-based approach SPFresh offers better scalability but suffers from reduced recall due to coarse partitioning. Moreover, SPFresh employs in-place updates to maintain its index structure, limiting its efficiency in handling high-throughput insertions and deletions under dynamic workloads. This paper presents LSM-VEC, a disk-based dynamic vector index that integrates hierarchical graph indexing with LSM-tree storage. By distributing the proximity graph across multiple LSM-tree levels, LSM-VEC supports out-of-place vector updates. It enhances search efficiency via a sampling-based probabilistic search strategy with adaptive neighbor selection, and connectivity-aware graph reordering further reduces I/O without requiring global reconstruction. Experiments on billion-scale datasets demonstrate that LSM-VEC consistently outperforms existing disk-based ANN systems. It achieves higher recall, lower query and update latency, and reduces memory footprint by over 66.2%, making it well-suited for real-world large-scale vector search with dynamic updates.

Paper Structure

This paper contains 16 sections, 12 equations, 8 figures, 2 algorithms.

Figures (8)

  • Figure 1: An example of pipeline of approximate nearest neighbor (ANN) search, consisting of index construction, candidate selection, and distance computation.
  • Figure 2: LSM-VEC architecture.
  • Figure 3: An illustration of vector insertion in LSM-VEC. The new node $v_n$ is connected to two bottom-layer neighbors $v_4$ and $v_5$, and the resulting edges are stored in the LSM-tree.
  • Figure 4: An example of graph ordering to improve I/O efficiency.
  • Figure 5: Evaluation of LSM-VEC under four update scenarios with different insert-delete ratios. We report recall, update latency, and search latency, simulating real-world dynamic workloads where the index continuously evolves. Each batch corresponds to 1% vector updates (1% insertion or 1% deletion).
  • ...and 3 more figures