Table of Contents
Fetching ...

Resampling and averaging coordinates on data

Andrew J. Blumberg, Mathieu Carriere, Jun Hou Fung, Michael A. Mandell

TL;DR

The paper tackles the instability of dimensionality reduction in high-dimensional data by building a robust embedding pipeline that (a) generates many candidate low-dimensional coordinates via subsampling and hyperparameter variation, (b) aligns them with affine Procrustes analysis, (c) selects a representative cluster using density metrics and topological data analysis via persistent homology, and (d) returns a final embedding obtained by averaging the aligned representatives while treating non-included points as outliers. The approach leverages Isomap and other DimRed methods as black boxes, and uses $PH_{0}$ and $PH_{1}$ invariants to favor contractible charts, thereby achieving robustness to noise and outliers. Theoretical examination of the Procrustes problem and ALS provides convergence and stability insights, while extensive synthetic and real-data experiments (including single-cell genomics) demonstrate improved robustness and practical applicability for obtaining reliable intrinsic coordinates. This framework offers a scalable, parameter-insensitive pathway to robust dimensionality reduction with clear diagnostic signals from PH invariants and Procrustes clustering, enabling more dependable downstream analyses in high-dimensional settings.

Abstract

We introduce algorithms for robustly computing intrinsic coordinates on point clouds. Our approach relies on generating many candidate coordinates by subsampling the data and varying hyperparameters of the embedding algorithm (e.g., manifold learning). We then identify a subset of representative embeddings by clustering the collection of candidate coordinates and using shape descriptors from topological data analysis. The final output is the embedding obtained as an average of the representative embeddings using generalized Procrustes analysis. We validate our algorithm on both synthetic data and experimental measurements from genomics, demonstrating robustness to noise and outliers.

Resampling and averaging coordinates on data

TL;DR

The paper tackles the instability of dimensionality reduction in high-dimensional data by building a robust embedding pipeline that (a) generates many candidate low-dimensional coordinates via subsampling and hyperparameter variation, (b) aligns them with affine Procrustes analysis, (c) selects a representative cluster using density metrics and topological data analysis via persistent homology, and (d) returns a final embedding obtained by averaging the aligned representatives while treating non-included points as outliers. The approach leverages Isomap and other DimRed methods as black boxes, and uses and invariants to favor contractible charts, thereby achieving robustness to noise and outliers. Theoretical examination of the Procrustes problem and ALS provides convergence and stability insights, while extensive synthetic and real-data experiments (including single-cell genomics) demonstrate improved robustness and practical applicability for obtaining reliable intrinsic coordinates. This framework offers a scalable, parameter-insensitive pathway to robust dimensionality reduction with clear diagnostic signals from PH invariants and Procrustes clustering, enabling more dependable downstream analyses in high-dimensional settings.

Abstract

We introduce algorithms for robustly computing intrinsic coordinates on point clouds. Our approach relies on generating many candidate coordinates by subsampling the data and varying hyperparameters of the embedding algorithm (e.g., manifold learning). We then identify a subset of representative embeddings by clustering the collection of candidate coordinates and using shape descriptors from topological data analysis. The final output is the embedding obtained as an average of the representative embeddings using generalized Procrustes analysis. We validate our algorithm on both synthetic data and experimental measurements from genomics, demonstrating robustness to noise and outliers.
Paper Structure (16 sections, 2 theorems, 34 equations, 20 figures)

This paper contains 16 sections, 2 theorems, 34 equations, 20 figures.

Key Result

Proposition 4.4

If $G$ is compact or the semi-direct product of a compact group and the translation group, then $\mathcal{E}$ achieves a global minimum.

Figures (20)

  • Figure 1: Swiss roll dataset
  • Figure 2: Some example outputs of Isomap applied to a noisy Swiss roll dataset.
  • Figure 3: An MDS representation of the Procrustes distances between Isomap outputs of 200 noisy Swiss rolls: each point represents an output, and the plot is shaded according to density of the points. Some representative points are indicated together with their corresponding Isomap embedding.
  • Figure 4: The Procrustes-aligned Isomap embedding corresponding to the cluster of unrolled Swiss rolls.
  • Figure 5: Additive Gaussian noise in the ambient space.
  • ...and 15 more figures

Theorems & Definitions (5)

  • Definition 4.3
  • Proposition 4.4
  • Proposition 4.5
  • proof
  • Definition 4.6