Table of Contents
Fetching ...

Distribution-free screening of spatially variable genes in spatial transcriptomics

Changhu Wang, Qiyun Huang, Zihao Chen, Jin Liu, Ruibin Xi

TL;DR

A distribution-free SVG screening method based on a novel quasi-likelihood ratio statistic, the MM-test, combined with a knockoff procedure to control the false discovery rate (FDR), which demonstrates that MM-test consistently outperforms existing SVG detection methods.

Abstract

Spatial transcriptomics (ST) technologies enable transcriptome-wide gene expression profiling while preserving spatial resolution, offering unprecedented opportunities to uncover complex spatial structures. Due to the ultra-high dimensionality of ST data, identifying spatially variable genes (SVGs) associated with unknown spatial clusters has become a central task in ST data analysis. Here, we develop a distribution-free SVG screening method based on a novel quasi-likelihood ratio statistic, the MM-test, combined with a knockoff procedure to control the false discovery rate (FDR). MM-test leverages auxiliary information, such as spatial distances, about the unknown spatial domains for SVG screening. Notably, in addition to two-dimensional ST datasets, MM-test is well-suited for increasingly common three-dimensional (3D), multi-slice ST datasets. Extensive benchmarking using simulations and 34 real ST datasets demonstrates that MM-test consistently outperforms existing SVG detection methods. In a 3D mouse brain dataset, MM-test accurately delineates fine-scale structures that are challenging for other methods, such as the 3D architecture of the pyramidal layer of the hippocampal cornu ammonis and the dentate gyrus. Theoretical guarantees-including selection consistency, FDR control, and an error bound for post-selection clustering-are also established.

Distribution-free screening of spatially variable genes in spatial transcriptomics

TL;DR

A distribution-free SVG screening method based on a novel quasi-likelihood ratio statistic, the MM-test, combined with a knockoff procedure to control the false discovery rate (FDR), which demonstrates that MM-test consistently outperforms existing SVG detection methods.

Abstract

Spatial transcriptomics (ST) technologies enable transcriptome-wide gene expression profiling while preserving spatial resolution, offering unprecedented opportunities to uncover complex spatial structures. Due to the ultra-high dimensionality of ST data, identifying spatially variable genes (SVGs) associated with unknown spatial clusters has become a central task in ST data analysis. Here, we develop a distribution-free SVG screening method based on a novel quasi-likelihood ratio statistic, the MM-test, combined with a knockoff procedure to control the false discovery rate (FDR). MM-test leverages auxiliary information, such as spatial distances, about the unknown spatial domains for SVG screening. Notably, in addition to two-dimensional ST datasets, MM-test is well-suited for increasingly common three-dimensional (3D), multi-slice ST datasets. Extensive benchmarking using simulations and 34 real ST datasets demonstrates that MM-test consistently outperforms existing SVG detection methods. In a 3D mouse brain dataset, MM-test accurately delineates fine-scale structures that are challenging for other methods, such as the 3D architecture of the pyramidal layer of the hippocampal cornu ammonis and the dentate gyrus. Theoretical guarantees-including selection consistency, FDR control, and an error bound for post-selection clustering-are also established.
Paper Structure (15 sections, 3 theorems, 10 equations, 5 figures, 3 tables)

This paper contains 15 sections, 3 theorems, 10 equations, 5 figures, 3 tables.

Key Result

Theorem 1

Assume that Conditions 1--2 hold. Let $0 < \varpi < 1/2-\vartheta$ be a constant. When $n$ is sufficiently large, for $r_n^{\vartheta}(\mathrm{log} n)^2 < T_n < r_n^{1/2-\varpi} \mathrm{log} n$ and $t_n = r_n^{-\kappa}, -1 + 3\vartheta < \kappa < 5/2,$ we have ${\mathbb{P}}\left( S_1 \neq \hat{S}_

Figures (5)

  • Figure 1: (A) Experimental design of the 3D spatial transcriptomics dataset from ortiz2020molecular. (B) Violin plots of quality control metrics across samples. The top panel shows nFeature_RNA (detected genes per cell), and the bottom shows nCount_RNA (total UMI counts). (C) 3D spatial visualization of Ier5 expression, which is enriched in the isocortex in this dataset. Color intensity indicates expression level, with red corresponding to higher expression. (D) Visualization of Ier5 expression in a two-dimensional slide.
  • Figure 2: Benchmarking results of SVG identification methods across 34 real ST datasets. The y-axis represents the mean values of AUPRC, AUROC, and early precision (EP) for each method, calculated separately for each silver standard (Wilcoxon test, and negative binomial regression). The last panel shows the mean ARI values for spatial clustering based on the SVGs identified by each method. Higher values indicate better performance.
  • Figure 3: (A) UMAP visualization of cell clusters identified by the proposed MM-test method. (B) Brain region annotations based on the Allen Brain Atlas. (C) Clustering results on slice 23B from various methods, using their selected SVGs across all 20 slices. The red circle highlights the DG region, which was only clearly identified by MM-test. The arrow points to the DG region. (D) Marker gene enrichment analysis: Heatmaps showing expression patterns of region-specific marker genes for clusters identified by MM-test and Moran. Marker genes are obtained from the Allen Brain Atlas lein2007genome. Region abbreviations as defined in (B).
  • Figure 4: 3D visualization of clusters identified by the MM-test across 20 consecutive slices. (A) All clusters identified by the MM-test. (B) 3D anatomical annotation of the CA and DG. (C) CAsp and DG regions (Cluster 10 and 17) identified by all methods. (D) CAsp (Cluster 10) and DG regions (Cluster 16 and 17) identified by all methods by MM-test. (E) Isocortex region identified by all methods. The arrow points to the misclassified spots from CTXsp region (i.e. Moran's Cluster 1).
  • Figure 5: Number of marker genes of different regions among top SVGs identified by different methods. Abbreviations as defined in Figure \ref{['fig:MM20']}(B).

Theorems & Definitions (4)

  • Example 1: Spatial transcriptomics data
  • Theorem 1
  • Theorem 2
  • Theorem 3