Table of Contents
Fetching ...

Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

Dehua Peng, Zhipeng Gui, Huayi Wu

TL;DR

The paper analyzes the curse of dimensionality by framing it through distance concentration and the manifold effect. It provides theoretical results showing that standard distance measures converge to degenerate relationships as dimension grows, e.g., $\lim_{d\to\infty} RDR = 0$ and $\lim_{d\to\infty} CCR = 0$, and corroborates these findings with empirical simulations and real-world data. The results highlight that high-dimensional data exhibit HDLSS characteristics, where distances lose discriminative power and most variance concentrates in a few principal components. The work argues for dimension reduction and feature selection (e.g., PCA, t-SNE, UMAP) as practical remedies in high-dimensional learning and clustering scenarios.

Abstract

The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal pattern and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification, or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize the potential challenges associated with manipulating high-dimensional data, and explains the possible causes for the failure of regression, classification, or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration, and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that, as the dimensionality increases, nearest neighbor search (NNS) using three classical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions.

Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

TL;DR

The paper analyzes the curse of dimensionality by framing it through distance concentration and the manifold effect. It provides theoretical results showing that standard distance measures converge to degenerate relationships as dimension grows, e.g., and , and corroborates these findings with empirical simulations and real-world data. The results highlight that high-dimensional data exhibit HDLSS characteristics, where distances lose discriminative power and most variance concentrates in a few principal components. The work argues for dimension reduction and feature selection (e.g., PCA, t-SNE, UMAP) as practical remedies in high-dimensional learning and clustering scenarios.

Abstract

The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal pattern and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification, or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize the potential challenges associated with manipulating high-dimensional data, and explains the possible causes for the failure of regression, classification, or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration, and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that, as the dimensionality increases, nearest neighbor search (NNS) using three classical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions.
Paper Structure (11 sections, 59 equations, 10 figures)

This paper contains 11 sections, 59 equations, 10 figures.

Figures (10)

  • Figure 1: An example of ten data samples for illustrating the data sparsity in high-dimensional space. (a) Data distributions of the samples in 1-D to 3-D feature space. (b) The trend of sample density as the dimension increases.
  • Figure 2: Manifold structure has an undesirable effect on boundary-seeking clustering algorithms. (a) Illustration of the CDC algorithm. (b) Boundary-based constraint cannot prevent cross-cluster connections in manifolds.
  • Figure 3: Illustration for proving Lemma 1.
  • Figure 4: Relative distance ratio of the Minkowski distance with 100 points and different norms. (a) Trends of the lower bound in Eq. \ref{['eq28']} and (b) the simulation results under different dimensions. (c) The bounds in Eq. \ref{['eq28']} and simulated results under $k=1$.
  • Figure 5: Relative distance ratio of the Minkowski distance using different numbers of points with a norm of (a) $k=1$, (b) $k=2$, and (c) $k=3$, respectively.
  • ...and 5 more figures