A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality

Eng-Jon Ong; Omer Bobrowski; Gesine Reinert; Primoz Skraba

A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality

Eng-Jon Ong, Omer Bobrowski, Gesine Reinert, Primoz Skraba

TL;DR

This paper introduces a novel ID estimator based on nearest-neighbor distance ratios that involves simple calculations and achieves state-of-the-art results and provides a theoretical analysis proving that the estimator is universal, namely, it converges to the true ID independently of the distribution generating the data.

Abstract

Estimating the intrinsic dimensionality (ID) of data is a fundamental problem in machine learning and computer vision, providing insight into the true degrees of freedom underlying high-dimensional observations. Existing methods often rely on geometric or distributional assumptions and can significantly fail when these assumptions are violated. In this paper, we introduce a novel ID estimator based on nearest-neighbor distance ratios that involves simple calculations and achieves state-of-the-art results. Most importantly, we provide a theoretical analysis proving that our estimator is \emph{universal}, namely, it converges to the true ID independently of the distribution generating the data. We present experimental results on benchmark manifolds and real-world datasets to demonstrate the performance of our estimator.

A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality

TL;DR

Abstract

Paper Structure (22 sections, 4 theorems, 37 equations, 9 figures, 7 tables)

This paper contains 22 sections, 4 theorems, 37 equations, 9 figures, 7 tables.

Introduction
Related Work
Context and Contributions
NN-Ratio Based Dimensionality Estimation
Theoretical Analysis
Proofs
Experimental Setup
Tuning the L2N2 Estimator Parameters
Evaluation Datasets
Benchmark Manifolds
Noisy Datasets
Real-World Datasets
Downstream Experiments
Experimental Results
Benchmark Manifold Results
...and 7 more sections

Key Result

Theorem 3.1

Under the assumptions above, as $n \rightarrow \infty$, where $C_{k,j}$ is independent of $d$ and $f$, and is given in eqn:Ckj.

Figures (9)

Figure 1: The (approximately) linear relationship between $\bar{L}_{k,j}$ and $\log(d)$ (for $(k,j) = (2,1)$, $(5,3)$, and $(8,4)$). Here we used 2,500 points, sampled from the $d$-dimensional Gaussian distribution. The error bars are computed from 1000 independent trials.
Figure 2: (a) MPE scores of L2N2 with $(k,j)=(2,j)$ and $(8,4)$, TwoNN, and GriDE across the individual benchmark manifolds for 2,500 points. (b) Comparing the MPE when dimensionality rounding is used. L2N2 without rounding is here added for reference.
Figure 3: Comparison of estimated ID of $d$-dimensional spheres for different methods and dimensions with the ground truth shown.
Figure 4: Estimated ID for real datasets. (a) ISOMAP faces; (b) digit "1" in MNIST (other digits can be found in the supplemental material); (c) CIFAR-100; (d) Isolet dataset. Bars denote $\pm 1$ standard deviation.
Figure 5: Downstream experiments results: Reconstruction errors (MSE) of autoencoders with bottleneck layers set to different values.
...and 4 more figures

Theorems & Definitions (10)

Remark 1
Theorem 3.1
Remark 2
Lemma 3.2
Lemma 3.3
Lemma 3.4
proof : Proof of Theorem \ref{['thm:univ_limit']}
proof : Proof of Lemma \ref{['lem:L2_limit']}
proof : Proof of Lemma \ref{['lem:bounded_limit']}
proof : Proof of Lemma \ref{['lem:homogeneous']}

A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality

TL;DR

Abstract

A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (10)