Table of Contents
Fetching ...

Let them have CAKES: A Cutting-Edge Algorithm for Scalable, Efficient, and Exact Search on Big Data

Morgan E. Prior, Thomas J. Howard, Oliver McLaughlin, Terrence Ferguson, Najib Ishaq, Noah M. Daniels

TL;DR

CAKES is a highly efficient and scalable algorithm for exact $k-NN search on Big Data, and it is demonstrated that CAKES exhibits near-constant scaling with cardinality on data conforming to the manifold hypothesis, and has perfect recall on data in \textit{metric} spaces.

Abstract

The ongoing Big Data explosion has created a demand for efficient and scalable algorithms for similarity search. Most recent work has focused on \textit{approximate} $k$-NN search, and while this may be sufficient for some applications, \textit{exact} $k$-NN search would be ideal for many applications. We present CAKES, a set of three novel, exact algorithms for $k$-NN search. CAKES's algorithms are generic over \textit{any} distance function, and they \textit{do not} scale with the cardinality or embedding dimension of the dataset, but rather with its metric entropy and fractal dimension. We test these claims on datasets from the ANN-Benchmarks suite under commonly-used distance functions, as well as on a genomic dataset with Levenshtein distance and a radio-frequency dataset with Dynamic Time Warping distance. We demonstrate that CAKES exhibits near-constant scaling with cardinality on data conforming to the manifold hypothesis, and has perfect recall on data in \textit{metric} spaces. We also demonstrate that CAKES exhibits significantly higher recall than state-of-the-art $k$-NN search algorithms when the distance function is not a metric. Additionally, we show that indexing and tuning time for CAKES is an order of magnitude, or more, faster than state-of-the-art approaches. We conclude that CAKES is a highly efficient and scalable algorithm for exact $k$-NN search on Big Data. We provide a Rust implementation of CAKES under an MIT license at https://github.com/URI-ABD/clam

Let them have CAKES: A Cutting-Edge Algorithm for Scalable, Efficient, and Exact Search on Big Data

TL;DR

CAKES is a highly efficient and scalable algorithm for exact $k-NN search on Big Data, and it is demonstrated that CAKES exhibits near-constant scaling with cardinality on data conforming to the manifold hypothesis, and has perfect recall on data in \textit{metric} spaces.

Abstract

The ongoing Big Data explosion has created a demand for efficient and scalable algorithms for similarity search. Most recent work has focused on \textit{approximate} -NN search, and while this may be sufficient for some applications, \textit{exact} -NN search would be ideal for many applications. We present CAKES, a set of three novel, exact algorithms for -NN search. CAKES's algorithms are generic over \textit{any} distance function, and they \textit{do not} scale with the cardinality or embedding dimension of the dataset, but rather with its metric entropy and fractal dimension. We test these claims on datasets from the ANN-Benchmarks suite under commonly-used distance functions, as well as on a genomic dataset with Levenshtein distance and a radio-frequency dataset with Dynamic Time Warping distance. We demonstrate that CAKES exhibits near-constant scaling with cardinality on data conforming to the manifold hypothesis, and has perfect recall on data in \textit{metric} spaces. We also demonstrate that CAKES exhibits significantly higher recall than state-of-the-art -NN search algorithms when the distance function is not a metric. Additionally, we show that indexing and tuning time for CAKES is an order of magnitude, or more, faster than state-of-the-art approaches. We conclude that CAKES is a highly efficient and scalable algorithm for exact -NN search on Big Data. We provide a Rust implementation of CAKES under an MIT license at https://github.com/URI-ABD/clam
Paper Structure (32 sections, 2 theorems, 8 equations, 4 figures, 5 tables, 6 algorithms)

This paper contains 32 sections, 2 theorems, 8 equations, 4 figures, 5 tables, 6 algorithms.

Key Result

Theorem 1

Let $X$ be a dataset and $q$ a query sampled from the same distribution (i.e., arising from the same generative process) as $X$. Then time complexity of performing Repeated $\rho$-NN search on $X$ with query $q$ is where $\mathcal{N}_{\hat{r}}(X)$ is the metric entropy of the dataset, $d$ is the LFD of the dataset, and $k$ is the number of nearest neighbors.

Figures (4)

  • Figure 1: $\delta$, $\delta^{+}$, and $\delta^{-}$ for a cluster $C$ and a query $q$. ${\color{blue}\delta} = f(q, c)$ is the distance from the query to the cluster center $c$. ${\color{red}\delta^{+}} = \delta + r$ is the distance from the query to the theoretically farthest point in $C$. ${\color{green}\delta^{-}} = \text{max}(0, \delta - r)$ is the distance from the query to the theoretically closest point in $C$.
  • Figure 2: Local fractal dimension vs. cluster depth across six datasets. The 'random' dataset is randomly generated according to the procedure in Section \ref{['sec:datasets-and-benchmarks:random-datasets']}; note that the y-axis is different for this dataset. In each plot, the horizontal axis denotes depth in the cluster tree, and the vertical axis denotes the LFD of clusters at that depth. We show lines for the 5$^{th}$, 25$^{th}$, 50th, 75$^{th}$ and 95$^{th}$ percentiles of LFD, as well as the minimum and maximum LFD at each depth. So that plots best reflect the distribution of LFDs across the entire dataset, we count each cluster as many times as its cardinality. For example, if, for some dataset, the 95$^{th}$ percentile of LFD at depth 40 is 3, this means that 95% of the points in clusters at depth 40 belong to a cluster whose LFD is at most 3.
  • Figure 3: Throughput across six datasets, including a randomly-generated dataset. In each plot, the horizontal axis represents increasing cardinality of the dataset, while the vertical axis represents the throughput in queries per second (higher is better). For Fashion-MNIST, Glove-25, and Sift, as cardinality increases, the CAKES algorithms become faster than linear search but the cardinality at which this occurs differs by dataset. For Fashion-MNIST and Glove-25, Depth-First Sieve is consistently fastest, while for Sift, Repeated $\rho$-NN is the fastest for smaller cardinalities and Depth-First Sieve is the fastest for larger cardinalities. With Silva, we observe that for all algorithms, throughput initially seems to linearly decrease as cardinality increases, but that it starts to level off at higher cardinalities. Depth-First Sieve is consistently the fastest algorithm on the Silva dataset. For Radio-ML and Random, we see that all three CAKES algorithms are slower than naïve linear search, and that their throughput decreases linearly with cardinality. HNSW and ANNOY are the fastest algorithms on all four datasets we benchmarked them on, but their recall degrades quickly as cardinality increases on all datasets. In particular, on the Random dataset, HNSW and ANNOY have near-zero recall.
  • Figure 4: Number of distance computations across four clustering strategies and three search algorithms on the Fashion-MNIST dataset. Adding the instrumentation to count the number of distance computations had the side-effect of significantly slowing down the search algorithms compared to those reported in Figure \ref{['fig:results:scaling-plots']}. The left column shows the throughput in queries per second, while the right column shows the mean number of distance computations per query. The x-axis represents increasing cardinality of the dataset.

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof