Table of Contents
Fetching ...

Quantum-inspired Benchmark for Estimating Intrinsic Dimension

Aritra Das, Joseph T. Iosue, Victor V. Albert

TL;DR

This work addresses the inconsistent intrinsic dimension (ID) estimates produced by existing IDE methods on real-world data by introducing QuIIEst, a quantum-inspired benchmark comprising infinite families of topologically non-trivial manifolds with known ground-truth IDs. Using Gilmore-Perelomov coherent-state embeddings, the authors generate homogeneous-space manifolds (e.g., Stiefel, Grassmannian, flag manifolds, and Pauli quotients) and even include non-manifold fractal examples like Hofstadter's butterfly to probe effective dimensionality. They evaluate six IDE methods across these manifolds, demonstrating that standard benchmarks often underrepresent difficulty and that embedding choices and distortions influence estimation accuracy; in particular, some manifolds are harder for IDEs than spheres with the same ground-truth dimension. The study further analyzes how data statistics and geometry relate to IDE performance and provides a scalable framework for future benchmarking, with plans to release datasets under CC BY 4.0 to advance reproducibility and comparative evaluation in intrinsic-dimension estimation.

Abstract

Machine learning models can generalize well on real-world datasets. According to the manifold hypothesis, this is possible because datasets lie on a latent manifold with small intrinsic dimension (ID). There exist many methods for ID estimation (IDE), but their estimates vary substantially. This warrants benchmarking IDE methods on manifolds that are more complex than those in existing benchmarks. We propose a Quantum-Inspired Intrinsic-dimension Estimation (QuIIEst) benchmark consisting of infinite families of topologically non-trivial manifolds with known ID. Our benchmark stems from a quantum-optical method of embedding arbitrary homogeneous spaces while allowing for curvature modification and additive noise. The IDE methods tested were generally less accurate on QuIIEst manifolds than on existing benchmarks under identical resource allocation. We also observe minimal performance degradation with increasingly non-uniform curvature, underscoring the benchmark's inherent difficulty. As a result of independent interest, we perform IDE on the fractal Hofstadter's butterfly and identify which methods are capable of extracting the effective dimension of a space that is not a manifold.

Quantum-inspired Benchmark for Estimating Intrinsic Dimension

TL;DR

This work addresses the inconsistent intrinsic dimension (ID) estimates produced by existing IDE methods on real-world data by introducing QuIIEst, a quantum-inspired benchmark comprising infinite families of topologically non-trivial manifolds with known ground-truth IDs. Using Gilmore-Perelomov coherent-state embeddings, the authors generate homogeneous-space manifolds (e.g., Stiefel, Grassmannian, flag manifolds, and Pauli quotients) and even include non-manifold fractal examples like Hofstadter's butterfly to probe effective dimensionality. They evaluate six IDE methods across these manifolds, demonstrating that standard benchmarks often underrepresent difficulty and that embedding choices and distortions influence estimation accuracy; in particular, some manifolds are harder for IDEs than spheres with the same ground-truth dimension. The study further analyzes how data statistics and geometry relate to IDE performance and provides a scalable framework for future benchmarking, with plans to release datasets under CC BY 4.0 to advance reproducibility and comparative evaluation in intrinsic-dimension estimation.

Abstract

Machine learning models can generalize well on real-world datasets. According to the manifold hypothesis, this is possible because datasets lie on a latent manifold with small intrinsic dimension (ID). There exist many methods for ID estimation (IDE), but their estimates vary substantially. This warrants benchmarking IDE methods on manifolds that are more complex than those in existing benchmarks. We propose a Quantum-Inspired Intrinsic-dimension Estimation (QuIIEst) benchmark consisting of infinite families of topologically non-trivial manifolds with known ID. Our benchmark stems from a quantum-optical method of embedding arbitrary homogeneous spaces while allowing for curvature modification and additive noise. The IDE methods tested were generally less accurate on QuIIEst manifolds than on existing benchmarks under identical resource allocation. We also observe minimal performance degradation with increasingly non-uniform curvature, underscoring the benchmark's inherent difficulty. As a result of independent interest, we perform IDE on the fractal Hofstadter's butterfly and identify which methods are capable of extracting the effective dimension of a space that is not a manifold.

Paper Structure

This paper contains 61 sections, 28 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: While most methods perform well when it comes to intrinsic dimension estimation (IDE) for simplistic manifolds like (hyper-)spheres, there's wide variability in their estimates for real-world datasets like MNIST. We propose QuIIEst --- a family of topologically non-trivial manifolds to serve as an intermediate confidence evaluation for IDE. The QuIIEst dataset contains several different embeddings of infinite families of manifolds whose dimension is polynomial in their parameters, and which admit nontrivial geometry and topology.
  • Figure 2: Comparison of the quantity $\langle |\delta|\rangle_M / \langle |\delta|\rangle_S - 1$, where the numerator is the average of the absolute value of the relative error $|\delta|$ over all instantiations of a given manifold family $M$, while the denominator is the corresponding average over all sphere embeddings with the same intrinsic and ambient dimensions. This relative comparison shows that tested methods tend to perform much worse against our manifolds than against spheres with the same dimensions. Interestingly, we observe a positive score with a change in embedding of the Grassmanian from "Proj" to "Vec", hinting that method accuracy depends on the type of embedding. Due to high computational time, we choose a smaller range of hyper-parameter sweeps for DANCo. A 1-$\sigma$ sampling error is reported here after the $\pm$ sign.
  • Figure 3: Effect of squeezing: We plot the relative error $\langle|\delta|\rangle$ as a function of the parameter $\epsilon$, which is a direct measure of anisotropy. Except for St (Matrix), most methods show negligible change as $\epsilon$ is increased.
  • Figure 4: Effect of additive noise. We report IDE performance when the uncorrupted data $\mathbf{x} \to \mathbf{x} + \epsilon$ where $\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{\Sigma})$ with $\mathbf{\Sigma}$ chosen to be proportional to the identity (isotropic), diagonal (uncorrelated) or a complete random symmetric matrix (anisotropic). We then plot the relative error $\delta$ as a function of the noise scale $\sigma^2$. Notice that there is no discernible change in behavior between the different noise types, except in the high noise limit, where the anisotropic noises are consistently underestimated. The figures shown here plot the absolute value of $\delta$, but we numerically confirmed that $\delta$ is smaller. A 1-$\sigma$ sampling error is plotted.
  • Figure 5: Effect of scaling with intrinsic dimension within the same family of manifolds. The relative error $\delta$ is plotted as a function of increasing $d_1$. Most manifolds show a transition from overestimation at small $d_i$ to underestimation at high $d_i$, corroborating earlier observations for other manifolds Levina_MLE. The Gr (Vec) embedding shows some minor differences at the same range for $d_i$, but is overall consistent.
  • ...and 14 more figures