Table of Contents
Fetching ...

Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation

Ben Harwood, Amir Dezfouli, Iadine Chades, Conrad Sanderson

TL;DR

This work tackles the problem of performing approximate nearest neighbor search on dynamic datasets, where online additions and model updates necessitate updating or rebuilding the ANN index. It empirically evaluates five popular methods (ScaNN, IVFPQ, k-d tree, ANNOY, HNSWG) on two dynamic proxies derived from SIFT1M and DEEP1B, explicitly accounting for update costs, event rates, and batch sizes. Key findings show that static-tuning often fails under dynamics, with k-d trees performing poorly relative to brute-force, while HNSWG provides consistent speedups for online data collection and ScaNN offers advantages for online feature learning at recall levels below around $75\%$. The results yield practical guidance for selecting and tuning ANN methods in dynamic settings and point to future directions such as update batching strategies, pruning, and modeling temporal dependencies to further improve performance.

Abstract

Approximate k-Nearest Neighbour (ANN) methods are often used for mining information and aiding machine learning on large scale high-dimensional datasets. ANN methods typically differ in the index structure used for accelerating searches, resulting in various recall/runtime trade-off points. For applications with static datasets, runtime constraints and dataset properties can be used to empirically select an ANN method with suitable operating characteristics. However, for applications with dynamic datasets, which are subject to frequent online changes (like addition of new samples), there is currently no consensus as to which ANN methods are most suitable. Traditional evaluation approaches do not consider the computational costs of updating the index structure, as well as the rate and size of index updates. To address this, we empirically evaluate 5 popular ANN methods on two main applications (online data collection and online feature learning) while taking into account these considerations. Two dynamic datasets are used, derived from the SIFT1M dataset with 1 million samples and the DEEP1B dataset with 1 billion samples. The results indicate that the often used k-d trees method is not suitable on dynamic datasets as it is slower than a straightforward baseline exhaustive search method. For online data collection, the Hierarchical Navigable Small World Graphs method achieves a consistent speedup over baseline across a wide range of recall rates. For online feature learning, the Scalable Nearest Neighbours method is faster than baseline for recall rates below 75%.

Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation

TL;DR

This work tackles the problem of performing approximate nearest neighbor search on dynamic datasets, where online additions and model updates necessitate updating or rebuilding the ANN index. It empirically evaluates five popular methods (ScaNN, IVFPQ, k-d tree, ANNOY, HNSWG) on two dynamic proxies derived from SIFT1M and DEEP1B, explicitly accounting for update costs, event rates, and batch sizes. Key findings show that static-tuning often fails under dynamics, with k-d trees performing poorly relative to brute-force, while HNSWG provides consistent speedups for online data collection and ScaNN offers advantages for online feature learning at recall levels below around . The results yield practical guidance for selecting and tuning ANN methods in dynamic settings and point to future directions such as update batching strategies, pruning, and modeling temporal dependencies to further improve performance.

Abstract

Approximate k-Nearest Neighbour (ANN) methods are often used for mining information and aiding machine learning on large scale high-dimensional datasets. ANN methods typically differ in the index structure used for accelerating searches, resulting in various recall/runtime trade-off points. For applications with static datasets, runtime constraints and dataset properties can be used to empirically select an ANN method with suitable operating characteristics. However, for applications with dynamic datasets, which are subject to frequent online changes (like addition of new samples), there is currently no consensus as to which ANN methods are most suitable. Traditional evaluation approaches do not consider the computational costs of updating the index structure, as well as the rate and size of index updates. To address this, we empirically evaluate 5 popular ANN methods on two main applications (online data collection and online feature learning) while taking into account these considerations. Two dynamic datasets are used, derived from the SIFT1M dataset with 1 million samples and the DEEP1B dataset with 1 billion samples. The results indicate that the often used k-d trees method is not suitable on dynamic datasets as it is slower than a straightforward baseline exhaustive search method. For online data collection, the Hierarchical Navigable Small World Graphs method achieves a consistent speedup over baseline across a wide range of recall rates. For online feature learning, the Scalable Nearest Neighbours method is faster than baseline for recall rates below 75%.
Paper Structure (8 sections, 4 figures)

This paper contains 8 sections, 4 figures.

Figures (4)

  • Figure 1: (a) Traditional ANN evaluation approaches use a single batch of searches performed on a static index. (b) To better reflect ANN use on dynamic datasets, a more thorough evaluation must take into account index updates, as well as the rate and size of the updates.
  • Figure 2: Speedup (logarithmic scale) over brute-force search as a function of average top-50 recall, shown on logarithmic scale. The speedup is the ratio of time taken by brute-force search to the time taken by a given ANN method. Results obtained on the Online Data Collection (ODC) dataset, starting with 100K samples, followed by 100K addition events, with event and search batch size of 1. (a) Parameters for each method were tuned following recommended practice for static search problems. (b) Parameters for each method were tuned via expanded parameter exploration in terms of range and resolution.
  • Figure 3: Speedup (logarithmic scale) over brute-force search as a function of average top-50 recall, shown on logarithmic scale. The speedup is the ratio of time taken by brute-force search to the time taken by a given ANN method. Results obtained on the Online Feature Learning (OFL) dataset, starting with 5K samples, followed by 100K update events, event and search batch size of 200. (a) Parameters for each method were tuned following recommended practice for static search problems. (b) Parameters for each method were tuned via expanded parameter exploration in terms of range and resolution.
  • Figure 4: Speedup over brute-force search as a function of average top-50 recall, while varying the rate and batch sizes of update events. Using the HNSWG method malkov2018hnsw on the OFL dataset, starting with 5k samples, followed by 100k update events, using initial event and search batch size of 200. For each subfigure, the axis limits were selected to zoom in on salient areas. (a) Increasing event rate with fixed search rate. (b) Increasing event batch size with fixed search batch size.