FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

Bing Tian; Haikun Liu; Yuhang Tang; Shihai Xiao; Zhuohui Duan; Xiaofei Liao; Xuecang Zhang; Junhua Zhu; Yu Zhang

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

Bing Tian, Haikun Liu, Yuhang Tang, Shihai Xiao, Zhuohui Duan, Xiaofei Liao, Xuecang Zhang, Junhua Zhu, Yu Zhang

TL;DR

FusionANNS is presented, a high-throughput, low-latency, cost-efficient, and high-accuracy ANNS system for billion-scale datasets using SSDs and only one entry-level GPU and proposes three novel designs: multi-tiered indexing to avoid data swapping between CPUs and GPU, heuristic re-ranking to eliminate unnecessary I/Os and computations while guaranteeing high accuracy, and redundant-aware I/O deduplication to further improve

Abstract

Approximate nearest neighbor search (ANNS) has emerged as a crucial component of database and AI infrastructure. Ever-increasing vector datasets pose significant challenges in terms of performance, cost, and accuracy for ANNS services. None of modern ANNS systems can address these issues simultaneously. We present FusionANNS, a high-throughput, low-latency, cost-efficient, and high-accuracy ANNS system for billion-scale datasets using SSDs and only one entry-level GPU. The key idea of FusionANNS lies in CPU/GPU collaborative filtering and re-ranking mechanisms, which significantly reduce I/O operations across CPUs, GPU, and SSDs to break through the I/O performance bottleneck. Specifically, we propose three novel designs: (1) multi-tiered indexing to avoid data swapping between CPUs and GPU, (2) heuristic re-ranking to eliminate unnecessary I/Os and computations while guaranteeing high accuracy, and (3) redundant-aware I/O deduplication to further improve I/O efficiency. We implement FusionANNS and compare it with the state-of-the-art SSD-based ANNS system -- SPANN and GPU-accelerated in-memory ANNS system -- RUMMY. Experimental results show that FusionANNS achieves 1) 9.4-13.1X higher query per second (QPS) and 5.7-8.8X higher cost efficiency compared with SPANN; 2) and 2-4.9X higher QPS and 2.3-6.8X higher cost efficiency compared with RUMMY, while guaranteeing low latency and high accuracy.

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
Background and Motivation
Indexing Techniques for ANNS
Product Quantization
Main Idea and Challenges
FusionANNS Overview
FusionANNS Design
Multi-tiered Indexing
Heuristic Re-ranking
Redundant-aware I/O Deduplication
Implementation
Evaluation
Performance
Scalability
Effectiveness of Individual Techniques
...and 3 more sections

Figures (12)

Figure 1: The framework of retrieval augmented generation
Figure 2: The hierarchical indexing technique in SPANN
Figure 3: The throughput and latency of SPANN
Figure 4: Three combinations of hierarchical indexing (HI), product quantization (PQ), and GPU acceleration
Figure 5: Differential characterization between queries
...and 7 more figures

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

TL;DR

Abstract

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

Authors

TL;DR

Abstract

Table of Contents

Figures (12)