Table of Contents
Fetching ...

Neighborhood Stability as a Measure of Nearest Neighbor Searchability

Thomas Vecchiato, Sebastian Bruch

TL;DR

Two measures for flat clusterings of high-dimensional points in Euclidean space are presented, one of which is an internal measure of clustering quality and the other is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM.

Abstract

Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.

Neighborhood Stability as a Measure of Nearest Neighbor Searchability

TL;DR

Two measures for flat clusterings of high-dimensional points in Euclidean space are presented, one of which is an internal measure of clustering quality and the other is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM.

Abstract

Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.
Paper Structure (23 sections, 3 theorems, 9 equations, 10 figures, 3 tables)

This paper contains 23 sections, 3 theorems, 9 equations, 10 figures, 3 tables.

Key Result

Theorem 1

For a clustering $C$ of $(\mathcal{X}, \delta)$ and a fixed set of weights $\omega$, $(C;\; \omega)$ satisfies the four axioms of axioms-of-clustering: consistency, richness, scale invariance, and isomorphism invariance. As such, $(C; \omega)$ is a measure of clustering quality.

Figures (10)

  • Figure 1: Spearman's correlation coefficient between clustering quality and top-$k$ accuracy. For Db, we present the negated coefficient because smaller values indicate better clustering quality.
  • Figure 2: Spearman's correlation coefficient between clustering quality and external evaluation metrics. For Db, the coefficient is negated because a smaller index indicates better clustering quality.
  • Figure 3: Point-Nsm distributions for various datasets.
  • Figure 4: Spearman's correlation coefficient between clustering quality and top-$k$Ann accuracy where the number of clusters is $\frac{1}{4} \sqrt{\lvert \mathcal{X} \rvert}$. For Db, we present the negated coefficient because smaller values indicate better clustering quality.
  • Figure 5: Spearman's correlation coefficient between clustering quality and top-$k$Ann accuracy where the number of clusters is $\frac{1}{2} \sqrt{\lvert \mathcal{X} \rvert}$. For Db, we present the negated coefficient because smaller values indicate better clustering quality.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Definition 1: $\alpha$-Stability and setnsm
  • Definition 2: clusteringnsm
  • Definition 3: pointnsm
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof