Distribution-free screening of spatially variable genes in spatial transcriptomics

Changhu Wang; Qiyun Huang; Zihao Chen; Jin Liu; Ruibin Xi

Distribution-free screening of spatially variable genes in spatial transcriptomics

Changhu Wang, Qiyun Huang, Zihao Chen, Jin Liu, Ruibin Xi

TL;DR

A distribution-free SVG screening method based on a novel quasi-likelihood ratio statistic, the MM-test, combined with a knockoff procedure to control the false discovery rate (FDR), which demonstrates that MM-test consistently outperforms existing SVG detection methods.

Abstract

Spatial transcriptomics (ST) technologies enable transcriptome-wide gene expression profiling while preserving spatial resolution, offering unprecedented opportunities to uncover complex spatial structures. Due to the ultra-high dimensionality of ST data, identifying spatially variable genes (SVGs) associated with unknown spatial clusters has become a central task in ST data analysis. Here, we develop a distribution-free SVG screening method based on a novel quasi-likelihood ratio statistic, the MM-test, combined with a knockoff procedure to control the false discovery rate (FDR). MM-test leverages auxiliary information, such as spatial distances, about the unknown spatial domains for SVG screening. Notably, in addition to two-dimensional ST datasets, MM-test is well-suited for increasingly common three-dimensional (3D), multi-slice ST datasets. Extensive benchmarking using simulations and 34 real ST datasets demonstrates that MM-test consistently outperforms existing SVG detection methods. In a 3D mouse brain dataset, MM-test accurately delineates fine-scale structures that are challenging for other methods, such as the 3D architecture of the pyramidal layer of the hippocampal cornu ammonis and the dentate gyrus. Theoretical guarantees-including selection consistency, FDR control, and an error bound for post-selection clustering-are also established.

Distribution-free screening of spatially variable genes in spatial transcriptomics

TL;DR

Abstract

Paper Structure (15 sections, 3 theorems, 10 equations, 5 figures, 3 tables)

This paper contains 15 sections, 3 theorems, 10 equations, 5 figures, 3 tables.

Introduction
Data description and motivation
Model setup and the MM-test statistic
Model setup
MM-test statistic
Working dispersion $\hat{\phi}$ specification via spatial information
FDR control via knockoff
Theoretical properties of the MM-test
Simulation
Performance on feature screening
Feature screening improves clustering analysis
Benchmarking on real spatial transcriptomics data
MM-test accurately detects fine-grained 3D spatial domains
Discussion
Data availability

Key Result

Theorem 1

Assume that Conditions 1--2 hold. Let $0 < \varpi < 1/2-\vartheta$ be a constant. When $n$ is sufficiently large, for $r_n^{\vartheta}(\mathrm{log} n)^2 < T_n < r_n^{1/2-\varpi} \mathrm{log} n$ and $t_n = r_n^{-\kappa}, -1 + 3\vartheta < \kappa < 5/2,$ we have ${\mathbb{P}}\left( S_1 \neq \hat{S}_

Figures (5)

Figure 1: (A) Experimental design of the 3D spatial transcriptomics dataset from ortiz2020molecular. (B) Violin plots of quality control metrics across samples. The top panel shows nFeature_RNA (detected genes per cell), and the bottom shows nCount_RNA (total UMI counts). (C) 3D spatial visualization of Ier5 expression, which is enriched in the isocortex in this dataset. Color intensity indicates expression level, with red corresponding to higher expression. (D) Visualization of Ier5 expression in a two-dimensional slide.
Figure 2: Benchmarking results of SVG identification methods across 34 real ST datasets. The y-axis represents the mean values of AUPRC, AUROC, and early precision (EP) for each method, calculated separately for each silver standard (Wilcoxon test, and negative binomial regression). The last panel shows the mean ARI values for spatial clustering based on the SVGs identified by each method. Higher values indicate better performance.
Figure 3: (A) UMAP visualization of cell clusters identified by the proposed MM-test method. (B) Brain region annotations based on the Allen Brain Atlas. (C) Clustering results on slice 23B from various methods, using their selected SVGs across all 20 slices. The red circle highlights the DG region, which was only clearly identified by MM-test. The arrow points to the DG region. (D) Marker gene enrichment analysis: Heatmaps showing expression patterns of region-specific marker genes for clusters identified by MM-test and Moran. Marker genes are obtained from the Allen Brain Atlas lein2007genome. Region abbreviations as defined in (B).
Figure 4: 3D visualization of clusters identified by the MM-test across 20 consecutive slices. (A) All clusters identified by the MM-test. (B) 3D anatomical annotation of the CA and DG. (C) CAsp and DG regions (Cluster 10 and 17) identified by all methods. (D) CAsp (Cluster 10) and DG regions (Cluster 16 and 17) identified by all methods by MM-test. (E) Isocortex region identified by all methods. The arrow points to the misclassified spots from CTXsp region (i.e. Moran's Cluster 1).
Figure 5: Number of marker genes of different regions among top SVGs identified by different methods. Abbreviations as defined in Figure \ref{['fig:MM20']}(B).

Theorems & Definitions (4)

Example 1: Spatial transcriptomics data
Theorem 1
Theorem 2
Theorem 3

Distribution-free screening of spatially variable genes in spatial transcriptomics

TL;DR

Abstract

Distribution-free screening of spatially variable genes in spatial transcriptomics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)