Table of Contents
Fetching ...

Scalable Amortized GPLVMs for Single Cell Transcriptomics Data

Sarah Zhao, Aditya Ravuri, Vidhi Lalchand, Neil D. Lawrence

TL;DR

This work tackles scalable, interpretable dimensionality reduction for single-cell RNA-seq by advancing Gaussian Process Latent Variable Models (GPLVMs) with amortized stochastic variational inference. It introduces an amortized BGPLVM tailored to scRNA-seq through domain-informed kernels (batch-correction SE-ARD+Linear and cell-cycle PerSE-ARD+Linear) and a data-aware ApproxPoisson likelihood based on library-size normalization, enabling robust clustering and uncertainty quantification. The method achieves performance comparable to the leading scVI approach on synthetic and COVID-19 datasets, while enabling explicit incorporation of prior biological knowledge to obtain more interpretable latent structures. Overall, the framework blends probabilistic modeling with domain knowledge to deliver scalable, interpretable embeddings for large-scale single-cell data, with potential for broader kernel-based customization.

Abstract

Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data. Gaussian Process Latent Variable Models (GPLVMs) offer an interpretable dimensionality reduction method, but current scalable models lack effectiveness in clustering cell types. We introduce an improved model, the amortized stochastic variational Bayesian GPLVM (BGPLVM), tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs. This model matches the performance of the leading single-cell variational inference (scVI) approach on synthetic and real-world COVID datasets and effectively incorporates cell-cycle and batch information to reveal more interpretable latent structures as we demonstrate on an innate immunity dataset.

Scalable Amortized GPLVMs for Single Cell Transcriptomics Data

TL;DR

This work tackles scalable, interpretable dimensionality reduction for single-cell RNA-seq by advancing Gaussian Process Latent Variable Models (GPLVMs) with amortized stochastic variational inference. It introduces an amortized BGPLVM tailored to scRNA-seq through domain-informed kernels (batch-correction SE-ARD+Linear and cell-cycle PerSE-ARD+Linear) and a data-aware ApproxPoisson likelihood based on library-size normalization, enabling robust clustering and uncertainty quantification. The method achieves performance comparable to the leading scVI approach on synthetic and COVID-19 datasets, while enabling explicit incorporation of prior biological knowledge to obtain more interpretable latent structures. Overall, the framework blends probabilistic modeling with domain knowledge to deliver scalable, interpretable embeddings for large-scale single-cell data, with potential for broader kernel-based customization.

Abstract

Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data. Gaussian Process Latent Variable Models (GPLVMs) offer an interpretable dimensionality reduction method, but current scalable models lack effectiveness in clustering cell types. We introduce an improved model, the amortized stochastic variational Bayesian GPLVM (BGPLVM), tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs. This model matches the performance of the leading single-cell variational inference (scVI) approach on synthetic and real-world COVID datasets and effectively incorporates cell-cycle and batch information to reveal more interpretable latent structures as we demonstrate on an innate immunity dataset.
Paper Structure (29 sections, 26 equations, 7 figures, 3 tables)

This paper contains 29 sections, 26 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of Modified BGPLVM Model
  • Figure 2: Ablation study with the simulated dataset on the proposed BGPLVM model where we change one component at a time (labeled in subfigures) and visualize the resulting UMAPs. The top row is colored by cell-type and the bottom row by batch.
  • Figure 3: UMAPs generated from the latent spaces of four models: an implementation of the original BGPLVM, the modified BGPLVM for scRNA-seq data, scVI, and a linear decoder scVI (LDVAE) for the COVID data set. The top row is color/shaded by cell type and the bottom by batch.
  • Figure 4: (Top row) Plots of log means and log variances (both parametrized by the same GP) versus learned cell-cycle pseudotime dimension for three specific genes (UBE2C, CDC6, FN1). The squares depict log variances and the circles depict log means of the library normalized data, both colored by the phases annotated in kumasaka2021mapping_innateimmunity. We see that our model's learned cell-cycle phases correspond roughly to the phases labelled in kumasaka2021mapping_innateimmunity. (Bottom row) UMAP plots of our model's learned latent space excluding directions identified with hidden technical effects (e.g. batch and plate border effects). Cells are colored by treatment condition (left), primary (middle) and secondary (right) pseudotime directions.
  • Figure 5: Overview of the scVI architecture adapted from lopez2018deep_scvi.
  • ...and 2 more figures