Table of Contents
Fetching ...

DiskJoin: Large-scale Vector Similarity Join with SSD

Yanqi Chen, Xiao Yan, Alexandra Meliou, Eric Lo

TL;DR

DiskJoin tackles large-scale vector similarity self-join on SSDs by organizing data into buckets, building a bucket graph of potential neighbor relationships, and orchestrating bucket-wise execution to maximize cache reuse. It combines three core innovations: (i) efficient bucketization with near-sequential disk layout and a light-weight HNSW-based center index, (ii) probabilistic pruning to sharply reduce candidate bucket pairs while preserving recall, and (iii) task orchestration using Belady's cache eviction and graph reordering to minimize disk loads. Empirical results on billion-scale datasets show DiskJoin achieving 2–3 orders-of-magnitude speedups over strong disk-based and distributed baselines, with nearly zero read amplification and IO no longer the bottleneck. The method generalizes to cross-joins and can accommodate attribute filtering, offering a practical, single-machine solution for industrial-scale similarity joins on SSDs.

Abstract

Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50x to 1000x.

DiskJoin: Large-scale Vector Similarity Join with SSD

TL;DR

DiskJoin tackles large-scale vector similarity self-join on SSDs by organizing data into buckets, building a bucket graph of potential neighbor relationships, and orchestrating bucket-wise execution to maximize cache reuse. It combines three core innovations: (i) efficient bucketization with near-sequential disk layout and a light-weight HNSW-based center index, (ii) probabilistic pruning to sharply reduce candidate bucket pairs while preserving recall, and (iii) task orchestration using Belady's cache eviction and graph reordering to minimize disk loads. Empirical results on billion-scale datasets show DiskJoin achieving 2–3 orders-of-magnitude speedups over strong disk-based and distributed baselines, with nearly zero read amplification and IO no longer the bottleneck. The method generalizes to cross-joins and can accommodate attribute filtering, offering a practical, single-machine solution for industrial-scale similarity joins on SSDs.

Abstract

Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50x to 1000x.

Paper Structure

This paper contains 17 sections, 1 theorem, 5 equations, 16 figures, 3 algorithms.

Key Result

Theorem 1

The MECC problem is NP-hard.

Figures (16)

  • Figure 1: Profiling results for a baseline solution---using the state-of-the-art SSD-based vector index DiskANN diskann to perform vector similarity join---and our DiskJoin on the BigANN100M dataset at 90% recall. Both methods use a memory size that is 10% of the dataset size.
  • Figure 2: Workflow of DiskJoin. Numbers (e.g., 1 and 2) indicate buckets, and letters (e.g., A and B) indicate vectors. (a) The inputs are the vector dataset and task configurations; (b) the vectors are grouped into buckets, and a bucket graph is constructed, where an edges means that the vectors in one bucket needs to check another bucket for neighbors; (c) task orchestration decides a good processing order for the edges in the bucket graph to reduce cache miss, where the cache can hold 2 buckets.
  • Figure 5: Given the bucket graph (a), the original task ordering (b) results in 8 cache misses in a cache that holds 3 nodes (rectangle box). The reordered schedule (c) reduces the cache misses to 6. In the illustration of candidate bucket pruning (d), bucket $b_3$ is pruned, and as a result, the neighbors in the white arc are missed.
  • Figure 6: Summary statistics of the experiment datasets
  • Figure 7: Despite being a disk-based method, DiskJoin is order of magnitudes faster than Clusterjoin and RSHJ---the latter failing to execute in larger data sizes. This is because DiskJoin performs orders-of-magnitude fewer distance computations.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Definition 1: Similarity self-join (SSJ) for vector dataset
  • Definition 2: MECC: Minimum Edge Cover with Cache
  • Theorem 1