Statistical Inference for Manifold Similarity and Alignability across Noisy High-Dimensional Datasets
Hongrui Chen, Rong Ma
TL;DR
This work develops Manifold Spectrometrics (MS), a framework for statistically testing and quantifying similarity between high-dimensional datasets whose signals lie on latent manifolds under heteroskedastic noise. The core idea links manifold geometry to observable spectral properties through a scale-invariant nMSD distance computed from the leading population variances, with a two-stage estimator that first denoises via low-rank fit and Potts segmentation and then performs inference using random-matrix theory, including outlier eigenvalue CLTs and secular equations. It provides a consistent estimator of nMSD, a Wald-type test for manifold alignability, and kernel-based extensions, with rigorous asymptotic guarantees and practical plug-in variance estimators. Simulations and multi-sample single-cell analyses demonstrate robust performance, higher statistical power than existing methods, and the ability to detect condition-specific structure after removing noise heterogeneity—facilitating principled comparison and alignment of complex, noisy datasets in biology and beyond.
Abstract
The rapid growth of high-dimensional datasets across various scientific domains has created a pressing need for new statistical methods to compare distributions supported on their underlying structures. Assessing similarity between datasets whose samples lie on low-dimensional manifolds requires robust techniques capable of separating meaningful signal from noise. We propose a principled framework for statistical inference of similarity and alignment between distributions supported on manifolds underlying high-dimensional datasets in the presence of heterogeneous noise. The key idea is to link the low-rank structure of observed data matrices to their underlying manifold geometry. By analyzing the spectrum of the sample covariance under a manifold signal-plus-noise model, we develop a scale-invariant distance measure between datasets based on their principal variance structures. We further introduce a consistent estimator for this distance and a statistical test for manifold alignability, and establish their asymptotic properties using random matrix theory. The proposed framework accommodates heterogeneous noise across datasets and offers an efficient, theoretically grounded approach for comparing high-dimensional datasets with low-dimensional manifold structures. Through extensive simulations and analyses of multi-sample single-cell datasets, we demonstrate that our method achieves superior robustness and statistical power compared with existing approaches.
