Table of Contents
Fetching ...

On the Use of Bagging for Local Intrinsic Dimensionality Estimation

Kristóf Péter, Ricardo J. G. B. Campello, James Bailey, Michael E. Houle

Abstract

The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.

On the Use of Bagging for Local Intrinsic Dimensionality Estimation

Abstract

The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.

Paper Structure

This paper contains 61 sections, 3 theorems, 40 equations, 29 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Define $\gamma(h,m) \triangleq \mathrm{Cov}(\hat{\theta}_{m,i}, \hat{\theta}_{m,j} \;|\;|\Pi^{(i)}\cap\Pi^{(j)}|=h)$ for the integer domain given by $m = r n \in \mathbb{Z^+}$ and $h=0,1,\dots,m$, namely, the covariance of two single-bag estimators given that we know the number of dependent variable

Figures (29)

  • Figure 1: The leftmost figure shows the Lollipop dataset LIDL overlaid with the cumulative distribution function (c.d.f) of the induced empirical distance distribution from the query at $(2,2)$, as a heatmap. The 2nd figure shows that, within close vicinity to the query, $t$, the c.d.f. is proportional to the square of the distance, becoming linear beyond the bounds of the 2D (circle-plate) submanifold, when the 1D (line) submanifold is reached. The 3rd figure shows a heatmap of LID estimates at the query at $(0.5,0.005)$ using MLE extremevaluetheoretic, as a function of the $k$-NN hyper-parameter, overlaid with the corresponding dataset distributed uniformly on a thin ribbon-like surface. The rightmost figure displays how LID estimates vary with $k$, not only affected by the interplay between locality preservation and sample size, but also changing with the resolution at which it is measured in practice. Within close vicinity from the query, where neither locality or resolution is critical, the small sample size results in high variance and --- for estimators that are only asymptotically unbiased (including MLE) --- also in bias.
  • Figure 2: Flowchart illustrating bagged LID estimation on a hypothetical, toy dataset. The algorithm is displayed for a single query point (with index 7) from the dataset. Note that it suffices to sample the bags only once and reuse them for different queries. The figure makes it clear that LID estimation can be done independently for each query point and for each bag, so processing is highly parallelizable across multiple nodes. For details on estimation across a dataset of queries, see the pseudo code in Appendix \ref{['Asec:Bagging for LID Algorithm']}.
  • Figure 3: Comparison of the relative MSE achieved by three different LID estimators, MLE, TLE, and MADA, with and without smoothing, bagging, and three strategies for combining bagging with smoothing. The results are for $19$ datasets using case-by-case optimal $k$ and $r$ hyper-parameters. The min-max normalized MSE values are subtracted from $1$ before plotting, such that larger scores correspond to smaller relative MSEs.
  • Figure 4: MSE and its decomposition for each of the 19 datasets as a function of the sampling rate $r$ used for bagged MLE as the LID estimator. Note that the baseline MLE is equivalent to $r=1$, displayed on the rightmost bar of the individual charts.
  • Figure 5: Heatmaps of the relative difference (log-ratio) between the MSE achieved by the baseline estimator (MLE) and the MSE of its bagged counterpart for each of the 19 datasets and cross-combinations of values for the sampling rate $r$ (x-axis) and the $k$-NN neighborhood size $k$ (y-axis). Positive values, color-coded as blue, indicate that bagging outperforms the baseline, whereas negative values, color-coded in red, indicate the opposite. The relation is symmetric around zero and the darkness of the cells are proportional to the magnitude of the corresponding absolute value for the given dataset.
  • ...and 24 more figures

Theorems & Definitions (8)

  • Definition 1: Subbagging
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • proof
  • proof
  • Theorem 3