Table of Contents
Fetching ...

Subspace Tensor Orthogonal Rotation Model (STORM) for Batch Alignment, Cell Type Deconvolution, and Gene Imputation in Spatial Transcriptomic Data

Sean Cottrell, Guo-Wei Wei, Longxiu Huang

Abstract

Spatial transcriptomics data analysis integrates cellular transcriptional activity with spatial coordinates to identify spatial domains, infer cell-type dynamics, and characterize gene expression patterns within tissues. Despite recent advances, significant challenges remain, including the treatment of batch effects, the handling of mixed cell-type signals, and the imputation of poorly measured or missing gene expression. This work addresses these challenges by introducing a novel Subspace Tensor Orthogonal Rotation Model (STORM) that aligns multiple slices which vary in their spatial dimensions and geometry by considering them at the level of physical patterns or microenvironments. To this end, STORM presents an irregular tensor factorization technique for decomposing a collection of gene expression matrices and integrating them into a shared latent space for downstream analysis. In contrast to black-box deep learning approaches, the proposed model is inherently interpretable. Numerical experiments demonstrate state-of-the-art performance in vertical and horizontal batch integration, cell-type deconvolution, and unmeasured gene imputation for spatial transcriptomics data.

Subspace Tensor Orthogonal Rotation Model (STORM) for Batch Alignment, Cell Type Deconvolution, and Gene Imputation in Spatial Transcriptomic Data

Abstract

Spatial transcriptomics data analysis integrates cellular transcriptional activity with spatial coordinates to identify spatial domains, infer cell-type dynamics, and characterize gene expression patterns within tissues. Despite recent advances, significant challenges remain, including the treatment of batch effects, the handling of mixed cell-type signals, and the imputation of poorly measured or missing gene expression. This work addresses these challenges by introducing a novel Subspace Tensor Orthogonal Rotation Model (STORM) that aligns multiple slices which vary in their spatial dimensions and geometry by considering them at the level of physical patterns or microenvironments. To this end, STORM presents an irregular tensor factorization technique for decomposing a collection of gene expression matrices and integrating them into a shared latent space for downstream analysis. In contrast to black-box deep learning approaches, the proposed model is inherently interpretable. Numerical experiments demonstrate state-of-the-art performance in vertical and horizontal batch integration, cell-type deconvolution, and unmeasured gene imputation for spatial transcriptomics data.
Paper Structure (14 sections, 2 theorems, 23 equations, 5 figures)

This paper contains 14 sections, 2 theorems, 23 equations, 5 figures.

Key Result

Theorem 4.1

Assume the STORM parameterization satisfies the following conditions: Define the slice-specific embedded factors Then for every slice $k$, i.e. the Gram matrix of the embedded factors $U_k$ is identical across slices. Consequently, all slices share the same latent covariance geometry in the aligned $R$-dimensional subspace; slice-specific variability in component magnitudes is captured by the

Figures (5)

  • Figure 1: Overview of the STORM framework. a. Given multi-slice (multi-batch) spatial transcriptomics data, we represent each slice $k=1,...,n_b$ as a spot-by-gene expression matrix $X_k \in \mathbb{R}^{n_{s_k} \times n_g}$ of $n_{s_k}$ spatial spots and $n_g$ genes. STORM learns, for each slice, an orthonormal spatial alignment matrix $Q_k\in \mathbb{R}^{n_{s_k} \times R}$ that acts only on the spot (spatial) mode and rotates slice-specific spatial subspaces into an $R$ dimensional shared spatial space. In this space, we obtain a regular subspace tensor $\mathcal{W} \in \mathbb{R}^{R \times n_g \times n_b}$ of ST slices which we can factorize. b. In this aligned space, STORM estimates shared factors $H\in \mathbb{R}^{R \times R}$ and $B\in \mathbb{R}^{n_g \times R}$ that capture common latent structure across slices, together with a tensor of diagonal scaling matrices $\mathcal{D}\in \mathbb{R}^{R \times R \times n_b}$, such that $\mathcal{D}_{::k} = \text{diag}(w_k)\in \mathbb{R}^{R \times R}$ ($w_k \in \mathbb{R}$) that modulates the contribution of the shared components within each slice. The reconstructed expression profiles are further refined using spatial proximity, protein-protein interaction (PPI) gene relations, and cross-sample spot adjacencies. c. The resulting shared embedding $Z_k = Q_kHD_k$ supports downstream analyses including data integration, cell-type deconvolution, and gene imputation.
  • Figure 2: Illustration of STORM clustering. a. Ground truth annotations of DLPFC samples 151673, 151674, 151675, and 151676 into innermost white matter region and 6 linearly outward extending neuronal layers. We additionally present comparisons of various integrated spatial domain detections on these samples against the STORM induced domains as well as the ground truth. We note the strong correspondence between the STORM domains and the ground truth relative to other methods. b. Integrated UMAP embeddings of each method colored by sample and ground truth spatial domain. Coloring by batches provides a qualitative measure of the integration of each sample by each method. Coloring by ground truth spatial domain provides a measure of the preservation of genuine biological variation across the samples for each method. c. UMAP embeddings induced by joint PCA on the concatenated gene expression matrices shows poor integration of samples as well as poor separation of spatial domains- illustrating the presence of batch effects and motivating additional computational techniques. d. Quantitative comparison of the performance of various methods for clustering and integration of the DLPFC data according to ARI and F1LISI metrics. STORM achieves the highest performance according to both metrics. e. Horizontal integration of the anterior and posterior sections of a mouse brain. We display the histology image alongside the ground truth Allen Mouse Brain annotations and compare spatial domain detections of various methods against STORM.
  • Figure 3: Illustration of STORM for cell type deconvolution. a. Synthetic low resolution seqFISH+ and MERFISH data is created by binning the cells in single-cell resolution ST data into multi-cell spots for convenient benchmarking. Both data types were binned to create synthetic spot sizes of 50 micrometers- matching the resolution of technologies such as Visium. b. Quantitative comparisons of cell type deconvolution performance of STORM against recent other methods taken from benchmarking results on the seqFISH+ and MERFISH synthetic data according to the JSD and RMSE metrics. c. Visualization of the spatial distribution of STORM induced spatial subspace factors $Q_kHD_k$ with Leiden spatial domains and cell type spatial mappings. The subspaces, clusters, and cell type proportions converge on coherent biological niches, such as germinal centers and plasma-endothelial microenvironments. Gene factor loadings in $B$ into these subspaces support a coherent biological narrative. This demonstrates the interpretatability of spatial factors as spatial niches. d. Spatial mapping of various B cells, FDCs, and T cells to germinal center zones of human lymph node sample by STORM as well as a quantitative comparison of STORM's performance against other methods in the task of mapping germinal center specific cells.
  • Figure 4: a. Overview of STORM workflow for gene imputation. STORM is used to construct a shared latent space between a ST sample and sc-RNA-seq reference in $Q_kHD_k$. This shared latent space is used to construct a 3-way sc-RNA-seq reference tensor of spatial spots and their neighboring single cell pseudo-spots, which can be input to a GraphSAGE model to predict imputed gene expression values for each ST cell from the single cell reference. A gene graph is constructed from the completed gene factor loadings in $B$ to further smooth the GraphSAGE model along this axis. b. Quantitative comparison of STORM against several other imputation methods on 5 benchmark datasets. Methods were measured on RMSE and CSS at the cell and gene levels for thoroughness of evaluation. STORM consistently ranks among or as the top performer.
  • Figure 5: STORM tracking of mouse embryo evolution. a. The STORM predicted spatial domains of the mouse embryo at developmental stages E10.5, E11.5, and E12.5. Clusters are annotated via the presence of known marker genes and show clear correspondence with known anatomical structure of the developing mouse embryo. b. UMAP embeddings produced by STORM colored by spatial domain as well as developmental stage label. These plots combine to demonstrate that STORM is able to integrate the samples without blurring the temporal differences between them. c. UMAP plots for several select spatial domains show that STORM embeddings depict clear developmental trends between time steps. Clusters in each development stage progress sequentially in the embedding space. Their co-localization implies the amelioration of batch effects, while their sequential ordering reflects that genuine biological variation has not been blurred. d. Plots of select spatial domains as well as differentially expressed marker genes within these domains. This demonstrates that the STORM domains show high concordance with known marker genes in the developing mouse embryo.

Theorems & Definitions (4)

  • Definition 4.1: PARAFAC2 Model
  • Theorem 4.1
  • proof
  • Corollary 4.1