Table of Contents
Fetching ...

scTree: Discovering Cellular Hierarchies in the Presence of Batch Effects in scRNA-seq Data

Moritz Vandenhirtz, Florian Barkmann, Laura Manduchi, Julia E. Vogt, Valentina Boeva

TL;DR

This work tackles the challenge of uncovering cellular hierarchies in scRNA-seq data when batch effects obscure true structure. It introduces scTree, an extension of TreeVAE that jointly learns a binary tree latent space and batch-corrected representations in an end-to-end framework, using leaf-specific decoders and a batch offset to model batch effects. A reconstruction loss-based splitting rule enables detection of imbalanced, rare cell types, enabling finer-grained hierarchies. Across seven datasets, scTree achieves competitive or superior clustering and hierarchy quality, particularly in datasets with strong batch effects, and discovers biologically plausible hierarchical structures, with code provided for reproducibility and reuse.

Abstract

We propose a novel method, scTree, for single-cell Tree Variational Autoencoders, extending a hierarchical clustering approach to single-cell RNA sequencing data. scTree corrects for batch effects while simultaneously learning a tree-structured data representation. This VAE-based method allows for a more in-depth understanding of complex cellular landscapes independently of the biasing effects of batches. We show empirically on seven datasets that scTree discovers the underlying clusters of the data and the hierarchical relations between them, as well as outperforms established baseline methods across these datasets. Additionally, we analyze the learned hierarchy to understand its biological relevance, thus underpinning the importance of integrating batch correction directly into the clustering procedure.

scTree: Discovering Cellular Hierarchies in the Presence of Batch Effects in scRNA-seq Data

TL;DR

This work tackles the challenge of uncovering cellular hierarchies in scRNA-seq data when batch effects obscure true structure. It introduces scTree, an extension of TreeVAE that jointly learns a binary tree latent space and batch-corrected representations in an end-to-end framework, using leaf-specific decoders and a batch offset to model batch effects. A reconstruction loss-based splitting rule enables detection of imbalanced, rare cell types, enabling finer-grained hierarchies. Across seven datasets, scTree achieves competitive or superior clustering and hierarchy quality, particularly in datasets with strong batch effects, and discovers biologically plausible hierarchical structures, with code provided for reproducibility and reuse.

Abstract

We propose a novel method, scTree, for single-cell Tree Variational Autoencoders, extending a hierarchical clustering approach to single-cell RNA sequencing data. scTree corrects for batch effects while simultaneously learning a tree-structured data representation. This VAE-based method allows for a more in-depth understanding of complex cellular landscapes independently of the biasing effects of batches. We show empirically on seven datasets that scTree discovers the underlying clusters of the data and the hierarchical relations between them, as well as outperforms established baseline methods across these datasets. Additionally, we analyze the learned hierarchy to understand its biological relevance, thus underpinning the importance of integrating batch correction directly into the clustering procedure.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic overview of the proposed method. The input ${\bm{x}}$ is passed through an encoder to be consequently reconstructed through a tree-shaped process. The process consists of probabilistically going left or right in each node, followed by a nonlinear transformation on the embedding ${\bm{z}}_i$. The cluster-specific decoders take as input their leaf-embedding and batch information, and reconstruct the gene count parameters of the negative binomial distribution.
  • Figure 2: Visualization of hierarchy discovered by scTree on IHC. The size of each node represents the number of cells assigned to it. We exclud empty leaves from the tree. Left: Hierarchy of cell types. Lymphoids, Myeloids and HSPCs have distinct colors. The "*" indicates cell types exclusive to the bone marrow samples. Right: Hierarchy of batches. Bone marrow and PBMC batches have distinct colors.
  • Figure 3: The plots show uniform manifold approximation and projections based on the first 50 PCs computed on the log-transformed normalized gene expression, the latent representations of scVI and LDVAE, and the Root node representation of scTree with both splitting rules of the Pancreas (a) and the IHC (b) datasets. The plots are colored by cell type (top), batch (middle), and cluster (bottom).