DiskJoin: Large-scale Vector Similarity Join with SSD
Yanqi Chen, Xiao Yan, Alexandra Meliou, Eric Lo
TL;DR
DiskJoin tackles large-scale vector similarity self-join on SSDs by organizing data into buckets, building a bucket graph of potential neighbor relationships, and orchestrating bucket-wise execution to maximize cache reuse. It combines three core innovations: (i) efficient bucketization with near-sequential disk layout and a light-weight HNSW-based center index, (ii) probabilistic pruning to sharply reduce candidate bucket pairs while preserving recall, and (iii) task orchestration using Belady's cache eviction and graph reordering to minimize disk loads. Empirical results on billion-scale datasets show DiskJoin achieving 2–3 orders-of-magnitude speedups over strong disk-based and distributed baselines, with nearly zero read amplification and IO no longer the bottleneck. The method generalizes to cross-joins and can accommodate attribute filtering, offering a practical, single-machine solution for industrial-scale similarity joins on SSDs.
Abstract
Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50x to 1000x.
