Table of Contents
Fetching ...

Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis

Jiayu Su, David A. Knowles, Raul Rabadan

TL;DR

This work introduces Supervised Independent Subspace Principal Component Analysis, a PCA extension designed for multi-subspace learning that incorporates supervision and simultaneously ensures subspace disentanglement.

Abstract

The success of machine learning models relies heavily on effectively representing high-dimensional data. However, ensuring data representations capture human-understandable concepts remains difficult, often requiring the incorporation of prior knowledge and decomposition of data into multiple subspaces. Traditional linear methods fall short in modeling more than one space, while more expressive deep learning approaches lack interpretability. Here, we introduce Supervised Independent Subspace Principal Component Analysis ($\texttt{sisPCA}$), a PCA extension designed for multi-subspace learning. Leveraging the Hilbert-Schmidt Independence Criterion (HSIC), $\texttt{sisPCA}$ incorporates supervision and simultaneously ensures subspace disentanglement. We demonstrate $\texttt{sisPCA}$'s connections with autoencoders and regularized linear regression and showcase its ability to identify and separate hidden data structures through extensive applications, including breast cancer diagnosis from image features, learning aging-associated DNA methylation changes, and single-cell analysis of malaria infection. Our results reveal distinct functional pathways associated with malaria colonization, underscoring the essentiality of explainable representation in high-dimensional data analysis.

Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis

TL;DR

This work introduces Supervised Independent Subspace Principal Component Analysis, a PCA extension designed for multi-subspace learning that incorporates supervision and simultaneously ensures subspace disentanglement.

Abstract

The success of machine learning models relies heavily on effectively representing high-dimensional data. However, ensuring data representations capture human-understandable concepts remains difficult, often requiring the incorporation of prior knowledge and decomposition of data into multiple subspaces. Traditional linear methods fall short in modeling more than one space, while more expressive deep learning approaches lack interpretability. Here, we introduce Supervised Independent Subspace Principal Component Analysis (), a PCA extension designed for multi-subspace learning. Leveraging the Hilbert-Schmidt Independence Criterion (HSIC), incorporates supervision and simultaneously ensures subspace disentanglement. We demonstrate 's connections with autoencoders and regularized linear regression and showcase its ability to identify and separate hidden data structures through extensive applications, including breast cancer diagnosis from image features, learning aging-associated DNA methylation changes, and single-cell analysis of malaria infection. Our results reveal distinct functional pathways associated with malaria colonization, underscoring the essentiality of explainable representation in high-dimensional data analysis.

Paper Structure

This paper contains 42 sections, 21 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Example scRNA-seq dataset from afriat2022spatiotemporally. Each dot represents the gene expression vector $\vec{x} \in \mathbb{R}^{8,203}$ of a cell, visualized in 2D and colored by cell properties $\{Y_m\}$. Variability in the dataset $X$ arises from multiple sources: (left to right) temporal dynamics of infection, technical batch effects, and cell quality. Incorporating supervisory information $Y$, such as time points, allows for the extraction of patterns in distinct subspaces $\{Z_m\}$ that correspond to different sources of variability. Moreover, the linear mapping $\{U_m: X \rightarrow Z_m\}$ directly quantifies the relationship between gene expression and the property of interest, enabling discoveries such as the identification of genes underlying the persistent defense against infection. The disentanglement is particularly important to ensure minimal confounding effects. See Section \ref{['sec:liver-infection']} for details.
  • Figure 2: Overview of sisPCA and its relationship with other PCA models.
  • Figure 3: Example application of recovering a latent space with three subspaces (rows in panel a) embedded in a high-dimensional space. The first two subspaces (rows) of sPCA (panel b) and sisPCA (panel c) are supervised by the corresponding target variables.
  • Figure 4: Feature extraction on the breast cancer dataset. The two top PC1 contributors in PCA (panel a) are used as supervisions to construct the 'radius' and 'symmetry' subspaces (panel b and c).
  • Figure 5: UMAP visualizations of scRNA-seq data. Each column shows a different learned subspace: (a) PCA, (b) sisPCA-infection and sisPCA-time, and (c) hsVAE-infection and hsVAE-time. See Fig. \ref{['fig:sup-liver-umap-full']} for other models. Cells are colored by either infection status (top row) or post-infection time (bottom row). In an optimal pair of subspaces, each property (infection status or time) should be more distinguishable in its corresponding subspace while showing less separation in the other.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Remark 3.1
  • Remark 3.2
  • Remark 3.3
  • Definition D.1