Table of Contents
Fetching ...

Statistical Inference for Manifold Similarity and Alignability across Noisy High-Dimensional Datasets

Hongrui Chen, Rong Ma

TL;DR

This work develops Manifold Spectrometrics (MS), a framework for statistically testing and quantifying similarity between high-dimensional datasets whose signals lie on latent manifolds under heteroskedastic noise. The core idea links manifold geometry to observable spectral properties through a scale-invariant nMSD distance computed from the leading population variances, with a two-stage estimator that first denoises via low-rank fit and Potts segmentation and then performs inference using random-matrix theory, including outlier eigenvalue CLTs and secular equations. It provides a consistent estimator of nMSD, a Wald-type test for manifold alignability, and kernel-based extensions, with rigorous asymptotic guarantees and practical plug-in variance estimators. Simulations and multi-sample single-cell analyses demonstrate robust performance, higher statistical power than existing methods, and the ability to detect condition-specific structure after removing noise heterogeneity—facilitating principled comparison and alignment of complex, noisy datasets in biology and beyond.

Abstract

The rapid growth of high-dimensional datasets across various scientific domains has created a pressing need for new statistical methods to compare distributions supported on their underlying structures. Assessing similarity between datasets whose samples lie on low-dimensional manifolds requires robust techniques capable of separating meaningful signal from noise. We propose a principled framework for statistical inference of similarity and alignment between distributions supported on manifolds underlying high-dimensional datasets in the presence of heterogeneous noise. The key idea is to link the low-rank structure of observed data matrices to their underlying manifold geometry. By analyzing the spectrum of the sample covariance under a manifold signal-plus-noise model, we develop a scale-invariant distance measure between datasets based on their principal variance structures. We further introduce a consistent estimator for this distance and a statistical test for manifold alignability, and establish their asymptotic properties using random matrix theory. The proposed framework accommodates heterogeneous noise across datasets and offers an efficient, theoretically grounded approach for comparing high-dimensional datasets with low-dimensional manifold structures. Through extensive simulations and analyses of multi-sample single-cell datasets, we demonstrate that our method achieves superior robustness and statistical power compared with existing approaches.

Statistical Inference for Manifold Similarity and Alignability across Noisy High-Dimensional Datasets

TL;DR

This work develops Manifold Spectrometrics (MS), a framework for statistically testing and quantifying similarity between high-dimensional datasets whose signals lie on latent manifolds under heteroskedastic noise. The core idea links manifold geometry to observable spectral properties through a scale-invariant nMSD distance computed from the leading population variances, with a two-stage estimator that first denoises via low-rank fit and Potts segmentation and then performs inference using random-matrix theory, including outlier eigenvalue CLTs and secular equations. It provides a consistent estimator of nMSD, a Wald-type test for manifold alignability, and kernel-based extensions, with rigorous asymptotic guarantees and practical plug-in variance estimators. Simulations and multi-sample single-cell analyses demonstrate robust performance, higher statistical power than existing methods, and the ability to detect condition-specific structure after removing noise heterogeneity—facilitating principled comparison and alignment of complex, noisy datasets in biology and beyond.

Abstract

The rapid growth of high-dimensional datasets across various scientific domains has created a pressing need for new statistical methods to compare distributions supported on their underlying structures. Assessing similarity between datasets whose samples lie on low-dimensional manifolds requires robust techniques capable of separating meaningful signal from noise. We propose a principled framework for statistical inference of similarity and alignment between distributions supported on manifolds underlying high-dimensional datasets in the presence of heterogeneous noise. The key idea is to link the low-rank structure of observed data matrices to their underlying manifold geometry. By analyzing the spectrum of the sample covariance under a manifold signal-plus-noise model, we develop a scale-invariant distance measure between datasets based on their principal variance structures. We further introduce a consistent estimator for this distance and a statistical test for manifold alignability, and establish their asymptotic properties using random matrix theory. The proposed framework accommodates heterogeneous noise across datasets and offers an efficient, theoretically grounded approach for comparing high-dimensional datasets with low-dimensional manifold structures. Through extensive simulations and analyses of multi-sample single-cell datasets, we demonstrate that our method achieves superior robustness and statistical power compared with existing approaches.

Paper Structure

This paper contains 22 sections, 16 theorems, 61 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Proposition 2.5

Under Assumption asmp:signal, let $S\in\mathbb{R}^p$ be centered with $\mathbb{E}[S]=0$ and let $M:=\mathbb{E}[SS^\top]$. With the coordinate functions $f_1,\ldots,f_n$ already defined by $(R\circ\iota)(x)=(f_1(x),\ldots,f_n(x),0,\ldots,0)$, we may, without loss of generality, apply an additional or

Figures (3)

  • Figure 1: Simulation summary across methods. (a–b) Equal signal with different noise: mean $p$ and rejection rate. (c–d) Equal noise with different signals: mean $p$ and rejection rate.
  • Figure 2: Pairwise $\widehat{\Pi}_r$–based distances. Panel \ref{['fig:onegroup']} compares donor-pair distances between and within conditions; panel \ref{['fig:twogroup']} decomposes within-condition dispersion.
  • Figure 3: Low-dimensional embeddings from $\widehat{\Pi}_r$–based distances. Donors separate by condition in both views, indicating that top-$r$ principal-variance patterns are condition-informative.

Theorems & Definitions (20)

  • Definition 2.1: Population and empirical signal covariance matrices
  • Definition 2.2: Normalized population principal variances
  • Definition 2.3: Manifold spectral distance
  • Remark 2.4: Choosing the working rank $r$
  • Proposition 2.5: PCA alignment within the signal subspace
  • Proposition 2.6: Low-rankness of the sample signal matrix
  • Proposition 2.7: Matrix spectral concentration, stability, and delocalization
  • Theorem 3.1: Consistency of noise variance estimators
  • Theorem 3.2: Consistency of nMSD estimator
  • Proposition 3.3
  • ...and 10 more