Table of Contents
Fetching ...

Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification

Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, Antonietta Mira

TL;DR

The paper tackles the scale-dependence of intrinsic dimension (ID) estimation by introducing ABIDE, an adaptive, likelihood-based extension of the Binomial Intrinsic Dimension Estimator (BIDE). ABIDE simultaneously learns per-point optimal neighbourhoods where the data density is approximately constant and updates the ID estimate, enabling robust performance in noisy, high-dimensional settings. The authors provide theoretical guarantees (convergence, consistency, asymptotic normality) and demonstrate superior performance over fixed-scale NN methods on synthetic benchmarks and real data (images and molecular trajectories). This approach yields more reliable, scale-aware characterizations of data geometry, with broad implications for dimensionality reduction, clustering, and density estimation in complex datasets.

Abstract

The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.

Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification

TL;DR

The paper tackles the scale-dependence of intrinsic dimension (ID) estimation by introducing ABIDE, an adaptive, likelihood-based extension of the Binomial Intrinsic Dimension Estimator (BIDE). ABIDE simultaneously learns per-point optimal neighbourhoods where the data density is approximately constant and updates the ID estimate, enabling robust performance in noisy, high-dimensional settings. The authors provide theoretical guarantees (convergence, consistency, asymptotic normality) and demonstrate superior performance over fixed-scale NN methods on synthetic benchmarks and real data (images and molecular trajectories). This approach yields more reliable, scale-aware characterizations of data geometry, with broad implications for dimensionality reduction, clustering, and density estimation in complex datasets.

Abstract

The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
Paper Structure (18 sections, 6 theorems, 39 equations, 9 figures, 1 algorithm)

This paper contains 18 sections, 6 theorems, 39 equations, 9 figures, 1 algorithm.

Key Result

lemma 1

Under Assumptions assump:data-assump:manifold, $k_i^*= O_P(\log n)$ for all $i=1,\dots,n$ as $n\to \infty$.

Figures (9)

  • Figure 1: The ABIDE algorithm, thanks to its iterative nature, progressively adjusts the estimates of the ID and the size of (approximately constant density) neighbourhoods, allowing it to escape the noise scale and to find the true ID of the underlying manifold. The first row shows how, throughout the iterations, the size of the neighbourhoods for two selected points $i$ and $j$ grows as the ID estimates lower from 2 down to 1. In the second row, we report the evolution of the ID estimate and the boxplots representing the subsequent distributions of the $k^*$.
  • Figure 2: Estimated ID as a function of embedding noise for two Gaussian datasets of different $d$ and with $n=5,000$ points. The shaded area is the Monte Carlo 99% confidence interval and the black dotted line is the true $d$.
  • Figure 3: Panel A: the original 2-dimensional distribution of the 20,000 artificial data points under examination (taken from d2021automatic). The description of the embedding procedure to obtain the actual dataset is described in the main text. Panel B: ID estimation using the standard non-adaptive Binomial Intrinsic Dimension Estimator (BIDE) at fixed radius or scale $t_B$ (blue circles) with associated p-values obtained through model validation (red squares). The isolated starred points represent, respectively, the ID estimate (blue) and the p-value (red) obtained with the ABIDE estimator. As a reference, the latter is placed in the average (over all data points) of the distance of the $k_i^*$ neighbour: $\frac{1}{n}\sum_{i=1}^n{t_{B,i}(k^*_i)}$. The dashed line (reported also in other panels) is the theoretical ID=2 of the original dataset before adding noise. Details and considerations in the main text. Panel C: ID estimation and p-values using BIDE at fixed neighbourhood size $k$ (on the x-axis). Also in this case ABIDE results are reported as the starred values, in correspondence of $\frac{1}{n}\sum_{i=1}^n{k^*_i}$. Panel D: points are coloured according to their $k^*_i$ value to visualize the concept of adaptive neighbourhood. Panel E: the evolution of the ID (blue) and the associated p-values (red) from model validation at the successive iteration of ABIDE. The final values at convergence are starred and reported also in the BIDE plots for comparison. Panel F: 2NN vs ABIDE and p-values as a function of the amplitude of the noise used to embed the data. In Section \ref{['sec:comparison']}, we report the estimates obtained with the other most commonly used NN-based ID estimators.
  • Figure 4: In the two upper rows, we report ABIDE performances on the OptDigits dataset made up of $n=3,823$ elements with embedding dimension $D=64$, together with a sample of 6 data points (first row). The first panel shows the evolution of the ID (blue), and of the p-value (red), with successive ABIDE iterations. In particular, one can appreciate how the ID estimated with ABIDE converges within 3 iterations and is significantly smaller than the 2NN estimate (from 10.43 to 8.05). The second and third panels show the scaling of the ID obtained with BIDE by fixing the radius $t_B$ or by fixing the number of neighbours $k$. Since no plateau is present in the scaling curves, without ABIDE we would not have a proper criterion to recognize a meaningful ID. Conversely, ABIDE - shown as starred points centred, respectively, in $\frac{1}{n}\sum_{i=1}^n{t_{B,i}(k^*_i)}\approx 3.5$ and $\frac{1}{n}\sum_{i=1}^n{k^*_i} \approx 9$ - allows us to uniquely identify the point-dependent scale as the largest neighbourhood size $k^*_i$ where the local density is approximately constant. The last panel shows the distribution of $k^*$ computed with the ID estimated by 2NN and with ABIDE at convergence. In the two bottom rows, we report ABIDE performances on the MNIST dataset, with sizes $n=70,000$ and $D=784$, together with a sample of 6 data points (top row). In the first panel, we observe how the ID drops from the initial value of 15.19 to 13.47. The second and third panels show the ID estimate at a fixed radius $t_B$ and at a fixed number of neighbours $k$, respectively. We notice that no plateaus are present and that p-values quickly become very low (in the case of fixed radius they are numerically 0). Conversely, ABIDE shows how the adaptive neighbourhood size allows for an ID estimate with a significantly larger p-value at an average scale $\frac{1}{n}\sum_{i=1}^n{t_{B,i}(k^*_i)}\approx 16$, corresponding to an average neighbourhood size of $\frac{1}{n}\sum_{i=1}^n{k^*_i}\approx 6.5$. Again, in the rightmost panel, one can observe the distributions of $k^*_i$ with 2NN and ABIDE.
  • Figure 5: ABIDE performances on the dihedral representation of CLN025 peptide, a dataset of size $n=3,758$, $D=32$. The first row shows three typical configurations of the peptide (from left to right): folded beta-hairpin, misfolded twisted, and unfolded. From the first panel of the second row, we can appreciate how the ID drops by 18%, from the 2NN value of 6.78 to the ABIDE one of 5.55. BIDE estimates obtained by selecting the average ABIDE scale are close to the ones obtained with ABIDE (second and third panels).
  • ...and 4 more figures

Theorems & Definitions (12)

  • lemma 1
  • proof
  • lemma 2
  • proof
  • proposition 1
  • proof
  • proposition 2
  • proof
  • proposition 3
  • proof
  • ...and 2 more