Table of Contents
Fetching ...

Principal Decomposition with Nested Submanifolds

Jiaji Su, Zhigang Yao

Abstract

Over the past decades, the increasing dimensionality of data has increased the need for effective data decomposition methods. Existing approaches, however, often rely on linear models or lack sufficient interpretability or flexibility. To address this issue, we introduce a novel nonlinear decomposition technique called the principal nested submanifolds, which builds on the foundational concepts of principal component analysis. This method exploits the local geometric information of data sets by projecting samples onto a series of nested principal submanifolds with progressively decreasing dimensions. It effectively isolates complex information within the data in a backward stepwise manner by targeting variations associated with smaller eigenvalues in local covariance matrices. Unlike previous methods, the resulting subspaces are smooth manifolds, not merely linear spaces or special shape spaces. Validated through extensive simulation studies and applied to real-world RNA sequencing data, our approach surpasses existing models in delineating intricate nonlinear structures. It provides more flexible subspace constraints that improve the extraction of significant data components and facilitate noise reduction. This innovative approach not only advances the non-Euclidean statistical analysis of data with low-dimensional intrinsic structure within Euclidean spaces, but also offers new perspectives for dealing with high-dimensional noisy data sets in fields such as bioinformatics and machine learning.

Principal Decomposition with Nested Submanifolds

Abstract

Over the past decades, the increasing dimensionality of data has increased the need for effective data decomposition methods. Existing approaches, however, often rely on linear models or lack sufficient interpretability or flexibility. To address this issue, we introduce a novel nonlinear decomposition technique called the principal nested submanifolds, which builds on the foundational concepts of principal component analysis. This method exploits the local geometric information of data sets by projecting samples onto a series of nested principal submanifolds with progressively decreasing dimensions. It effectively isolates complex information within the data in a backward stepwise manner by targeting variations associated with smaller eigenvalues in local covariance matrices. Unlike previous methods, the resulting subspaces are smooth manifolds, not merely linear spaces or special shape spaces. Validated through extensive simulation studies and applied to real-world RNA sequencing data, our approach surpasses existing models in delineating intricate nonlinear structures. It provides more flexible subspace constraints that improve the extraction of significant data components and facilitate noise reduction. This innovative approach not only advances the non-Euclidean statistical analysis of data with low-dimensional intrinsic structure within Euclidean spaces, but also offers new perspectives for dealing with high-dimensional noisy data sets in fields such as bioinformatics and machine learning.

Paper Structure

This paper contains 23 sections, 15 theorems, 180 equations, 12 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Let $\mathcal{M}$ be an embedded submanifold of $\mathbb{R}^{D}$. Then,

Figures (12)

  • Figure 1: Illustration for the involved geometrical concepts.
  • Figure 2: Conceptional illustration of the principal nested submanifolds. Each $\mathcal{M}_{r,d}$ is determined as the root set of $\sum_{k=1}^{D-d} b_{r,k}(z)$ and is nested within $\mathcal{M}_{r,d+1}$.
  • Figure 3: Scatter plots for the line segment case in $\mathbb{R}^3$ with colors added to enhance the visualization of sample adjacency. (a) Input data set; (b,c) Projection results onto the principal nested submanifolds of dimensions 2 and 1, respectively; (d,e) Projections onto the first two and single principal components, respectively.
  • Figure 4: Scatter plots for the circle case in $\mathbb{R}^3$ with colors added to enhance the visualization of sample adjacency. (a) Input data set; (b,c) Projection results onto the principal nested submanifolds of dimensions 2 and 1, respectively; (d,e) Projections onto the first two and single principal components, respectively.
  • Figure 5: Scatter plots for the involute case in $\mathbb{R}^3$ with colors added to enhance the visualization of sample adjacency. (a) Input data set; (b,c) Projection results onto the principal nested submanifolds of dimensions 2 and 1, respectively; (d,e) Projections onto the first two and single principal components, respectively.
  • ...and 7 more figures

Theorems & Definitions (29)

  • Definition 1: Reach
  • Remark
  • Lemma 1: Federer's reach condition
  • Definition 2: Fréchet Mean
  • Proposition 1
  • Theorem 1
  • Corollary 1.1
  • Theorem 2: Consistency of empirical principal submanifolds
  • proof
  • Lemma A.1
  • ...and 19 more