Table of Contents
Fetching ...

$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data

Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna

TL;DR

The resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them.

Abstract

Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$Γ$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.

$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data

TL;DR

The resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them.

Abstract

Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.
Paper Structure (8 sections, 5 equations, 4 figures)

This paper contains 8 sections, 5 equations, 4 figures.

Figures (4)

  • Figure 1: Explicit regularization of the model manifold curvature. (a) Schematic of tissues (colored points) connected to nearest neighbors (lines) in the high-dimensional space of gene expression. (b) 2-D UMAP embedding of the joint TCGA + GTEX datasets, with a sampled grid in the center. (c) PCA of the sampled grid projected back into gene space. (d) Schematic of a continuous and differentiable manifold through the tissue samples. (e) Embedding of the dataset using $\beta$-VAE, where the embedding is colored by the maximum parameter-effects curvature (copper, left half), and the maximum extrinsic curvature (grayscale, right), with visibly extreme distortions. (f) PCA of a decoded grid shows a sharply deformed manifold. (g) Schematic of a manifold through tissue samples with less curvature. (h) Embedding of the dataset using $\Gamma$-VAE, demonstrating significantly less parameter-effects and extrinsic curvature, which can be further seen in (i) the PCA of a decoded portion of the latent space. (j). Schematic of a PCA through the datapoints, with (k) a corresponding PCA of the linear embedding and (l) a subset of points.
  • Figure 2: A geometric 3 dimensional atlas of human tissue and cancer gene expression. (a) Embedding of the joint TCGA + GTEX dataset into a 3 dimensional latent space of a highly regularized VAE, with a green plane spanning liver and muscle, and a red plane spanning the blood to the brain. (b) Projection of the decoded red and green planes onto a linear PCA of the data, showing the significant yet regularized curvature of the VAE manifold. (c) Angles between the tangent space (in gene space) at the origin versus the tangent space radially away from the origin. (d) Plot of the VAE embedding colored by the gene signature of adaptive immune response---which defines the axis from blood to brain---(e) p53 pathway---which defines an axis from bone marrow to the majority of cancer tissues---(f) and epithelial mesenchymal transition---which shows complex spatial gradients in the cancer clusters. (g) Plot of a subset of nine healthy tissues with arrows drawn to their corresponding adenocarcinomas, forming a distinct and uniform cancer axis. The embedding also shows a distinct orthogonal separation of squamous cell carcinomas. (h) Decoded gene trajectories from healthy colon to colon adenocarcinoma for select genes, showing curved, nonlinear pahtways through gene space.
  • Figure 3: Out-of-distribution generalizability of reguarlized embeddings to unseen cancers. (a) A $\Gamma$-VAE embedding of the joint TCGA + GTEX dataset, with a subset of tissues (breast carcinoma, BRCA) colored in a copper gradient based on their distance to healthy breast tissue. (b) a zoomed-in plot of the joint breast and BRCA tissues in the VAE space, where the triple-negative breast carcinomas (TNBC) are plotted as tetrahedra. (c) A regularized embedding trained while leaving out all BRCA and BRCA juxta-tumor samples, with the BRCA tissues re-embedded after training. (d) A zoomed-in plot of the joint breast and re-embedded BRCA tissues that successfully separates TNBC from non-TNBC tissues, where the BRCA tissues are colored the same as in panel (b). (e) Pairwise-distance correlations for UMAP, unregularized VAE, and regularized VAE of each cancer tissue and all other tissues between embeddings where the cancer tissue was included or excluded from the training. (f) Column-normalized density plots of all pairwise distances for each cancer tissue and all other tissues between the original embedding including the tissue, and the out-of-sample embedding that excluded the cancer tissue from training.
  • Figure 4: Cell fate prediction using curved embeddings. (a) $\Gamma$-VAE embedding of a lineage tracing experiment on hematopoietic stem cell differentiation. (b) Subset of multipotent LSK cells from a PCA of the data, yielding a 49% classification accuracy of the eventual fate of their daughter cells. (c) The same subset of cells from the cf-VAE of the data, yielding a 60% classification accuracy of the eventual fate of their daughter cells. (d) re-embedding of all data on the $\Gamma$-VAE manifold trained only on day 2 cells. (e) Column-normalized density plot of pairwise distances between day 4 and 6 cells in the embedding, with the pairwise distances of a $\Gamma$-VAE trained on all data on the x-axis, and the pairwise distances of a $\Gamma$-VAE trained on only day 2 data on the y-axis. (f) Re-embedding of day 4, 6 cells on the PCA projections defined on only the day 2 cells.