Table of Contents
Fetching ...

A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search

Yiping Sun, Yang Shi, Jiaolong Du

TL;DR

This paper introduces a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS), which utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking, and introduces a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement.

Abstract

In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.

A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search

TL;DR

This paper introduces a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS), which utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking, and introduces a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement.

Abstract

In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.
Paper Structure (13 sections, 4 equations, 4 figures, 1 table, 4 algorithms)

This paper contains 13 sections, 4 equations, 4 figures, 1 table, 4 algorithms.

Figures (4)

  • Figure 1: (a) Inverted Id List and Vector Arrangement in Faiss/Raft: new vectors are appended, and new memory space is allocated. Old memory space is freed after the new one is ready. (b) Inverted Id List and Vector Arrangement in Proposed Method: the memory block is applied when new vectors need to be inserted. Each memory block has a header indicating its previous and next blocks. In this example, the IDs 0, 2, 6, 7, 9, and 10 can be connected as a single ID list. (c) Inverted Id List and Vector Arrangement after In-place Rearrangement: if the memory block list Exceed($m'$), dynamic rearrangement of fragmented memory blocks are executed. Temporary segments are utilized in this process. In this example, the IDs 4, 5, 8, and "pad" are aggregated, optimizing the header jump from twice to one.
  • Figure 2: (a) Single Stream Serialize Mode (Faiss/Raft): execute kernel in a single-stream, cannot handle scenario including new coming vectors. (b) Multi Stream Parallel Model (Proposed Method): enabling kernel execution under a parallel form, not only improving online gpu utilization but also adapting real-time vector operations naturally.
  • Figure 3: Latency Comparison on SIFT1m and DSSMRT40M under QPS$_{search}$ = 1000, 5000, 10000
  • Figure 4: Latency/Memory Change on Memory Block Size