Table of Contents
Fetching ...

CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion

Xianzhi Zeng, Zhuoyan Wu, Xinjing Hu, Xuanhua Shi, Shixuan Sun, Shuhao Zhang

TL;DR

The paper addresses the gap in AKNN benchmarking for dynamic data ingestion and distribution drift by introducing CANDY, a benchmark framework that evaluates continuous AKNN under online data streams. It combines data handling, algorithm categorization into ranging-based and navigation-based methods, and optimization techniques (including ML-based hashing and distance computation strategies) to assess performance on real-world and drift-enabled synthetic workloads. Key findings show that simpler baselines can outperform more complex AKNN in dynamic settings, and that ingestion efficiency and distribution shifts are critical bottlenecks, with optimizations offering gains but not fully overcoming these challenges. The work provides a practical platform for evaluating AKNN in streaming environments, guiding future research toward robust, adaptable retrieval systems suitable for real-time data ecosystems such as RAG deployments.

Abstract

Approximate K Nearest Neighbor (AKNN) algorithms play a pivotal role in various AI applications, including information retrieval, computer vision, and natural language processing. Although numerous AKNN algorithms and benchmarks have been developed recently to evaluate their effectiveness, the dynamic nature of real-world data presents significant challenges that existing benchmarks fail to address. Traditional benchmarks primarily assess retrieval effectiveness in static contexts and often overlook update efficiency, which is crucial for handling continuous data ingestion. This limitation results in an incomplete assessment of an AKNN algorithms ability to adapt to changing data patterns, thereby restricting insights into their performance in dynamic environments. To address these gaps, we introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference to supplant traditional heuristic scans, and improved distance computation methods to reduce computational overhead. Our extensive evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency. These findings challenge established beliefs about the necessity of algorithmic complexity for high performance. Furthermore, our results underscore existing challenges and illuminate future research opportunities. We have made the datasets and implementation methods available at: https://github.com/intellistream/candy.

CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion

TL;DR

The paper addresses the gap in AKNN benchmarking for dynamic data ingestion and distribution drift by introducing CANDY, a benchmark framework that evaluates continuous AKNN under online data streams. It combines data handling, algorithm categorization into ranging-based and navigation-based methods, and optimization techniques (including ML-based hashing and distance computation strategies) to assess performance on real-world and drift-enabled synthetic workloads. Key findings show that simpler baselines can outperform more complex AKNN in dynamic settings, and that ingestion efficiency and distribution shifts are critical bottlenecks, with optimizations offering gains but not fully overcoming these challenges. The work provides a practical platform for evaluating AKNN in streaming environments, guiding future research toward robust, adaptable retrieval systems suitable for real-time data ecosystems such as RAG deployments.

Abstract

Approximate K Nearest Neighbor (AKNN) algorithms play a pivotal role in various AI applications, including information retrieval, computer vision, and natural language processing. Although numerous AKNN algorithms and benchmarks have been developed recently to evaluate their effectiveness, the dynamic nature of real-world data presents significant challenges that existing benchmarks fail to address. Traditional benchmarks primarily assess retrieval effectiveness in static contexts and often overlook update efficiency, which is crucial for handling continuous data ingestion. This limitation results in an incomplete assessment of an AKNN algorithms ability to adapt to changing data patterns, thereby restricting insights into their performance in dynamic environments. To address these gaps, we introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference to supplant traditional heuristic scans, and improved distance computation methods to reduce computational overhead. Our extensive evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency. These findings challenge established beliefs about the necessity of algorithmic complexity for high performance. Furthermore, our results underscore existing challenges and illuminate future research opportunities. We have made the datasets and implementation methods available at: https://github.com/intellistream/candy.
Paper Structure (33 sections, 8 figures, 7 tables)

This paper contains 33 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of the CANDY benchmark framework.
  • Figure 2: Tuning the occurrence position of distribution drift. The Red dashed line indicates where the distribution drift occurs exactly at the same place as the beginning of online ingested data.
  • Figure 3: Tuning the intensity of distribution drift. A larger contamination probability indicates higher intensity.
  • Figure 4: Breakdown of \ref{['tab:algosList']} on \ref{['tab:realworld_workloads']} and \ref{['tab:realworld_workloads']}.
  • Figure 5: Tuning the size of micro-batches from 200 to 50000.
  • ...and 3 more figures