Table of Contents
Fetching ...

Kernel-Based Testing for Single-Cell Differential Analysis

Anthony Ozier-Lafontaine, Camille Fourneaux, Ghislain Durif, Polina Arsenteva, Céline Vallot, Olivier Gandrillon, Sandrine Giraud, Bertrand Michel, Franck Picard

TL;DR

Kernel-based testing addresses the challenge of differential analysis in single-cell data by comparing full cell-wise distributions rather than univariate means. The authors develop two complementary tests based on weighted kernel-mean embeddings and Kernel Fisher Discriminant Analysis, with bandwidth selection and a zero-inflated extension for scRNA-Seq. They validate the approach on simulations and multiple public scRNA-Seq datasets, showing calibrated type-I error, strong power to detect non-linear alternatives, and competitive performance with existing methods; they also apply it to scChIP-Seq to reveal epigenomic heterogeneity and persister-like subpopulations. The results demonstrate the method's ability to uncover subtle population heterogeneities and to identify candidate subpopulations and epigenomic states, offering a flexible framework for multi-group designs in single-cell analysis.

Abstract

Single-cell technologies offer insights into molecular feature distributions, but comparing them poses challenges. We propose a kernel-testing framework for non-linear cell-wise distribution comparison, analyzing gene expression and epigenomic modifications. Our method allows feature-wise and global transcriptome/epigenome comparisons, revealing cell population heterogeneities. Using a classifier based on embedding variability, we identify transitions in cell states, overcoming limitations of traditional single-cell analysis. Applied to single-cell ChIP-Seq data, our approach identifies untreated breast cancer cells with an epigenomic profile resembling persister cells. This demonstrates the effectiveness of kernel testing in uncovering subtle population variations that might be missed by other methods.

Kernel-Based Testing for Single-Cell Differential Analysis

TL;DR

Kernel-based testing addresses the challenge of differential analysis in single-cell data by comparing full cell-wise distributions rather than univariate means. The authors develop two complementary tests based on weighted kernel-mean embeddings and Kernel Fisher Discriminant Analysis, with bandwidth selection and a zero-inflated extension for scRNA-Seq. They validate the approach on simulations and multiple public scRNA-Seq datasets, showing calibrated type-I error, strong power to detect non-linear alternatives, and competitive performance with existing methods; they also apply it to scChIP-Seq to reveal epigenomic heterogeneity and persister-like subpopulations. The results demonstrate the method's ability to uncover subtle population heterogeneities and to identify candidate subpopulations and epigenomic states, offering a flexible framework for multi-group designs in single-cell analysis.

Abstract

Single-cell technologies offer insights into molecular feature distributions, but comparing them poses challenges. We propose a kernel-testing framework for non-linear cell-wise distribution comparison, analyzing gene expression and epigenomic modifications. Our method allows feature-wise and global transcriptome/epigenome comparisons, revealing cell population heterogeneities. Using a classifier based on embedding variability, we identify transitions in cell states, overcoming limitations of traditional single-cell analysis. Applied to single-cell ChIP-Seq data, our approach identifies untreated breast cancer cells with an epigenomic profile resembling persister cells. This demonstrates the effectiveness of kernel testing in uncovering subtle population variations that might be missed by other methods.
Paper Structure (29 sections, 25 equations, 12 figures, 3 tables)

This paper contains 29 sections, 25 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Top : Examples of distributions of the simulated data, DE : classical difference in expression, DM : difference in modalities, DP : difference in proportions, DB : difference in both modalities and proportions with equal means. Bottom : projection of cells on the discriminant axis ($T=4$) for each alternative. The non-linear transform allows the separation of distributions on the discriminant axis.
  • Figure 2: Comparison of DEA methods with respect to type-I errors and power. Top: Type-I errors are computed on raw $p$-values under $H_0$. False discovery Rate computed on Benjamini-Hochberg adjusted $p$-values. Power computed on raw $p$-values under $H_1$. True Discovery Rate computed on Benjamini-Hochberg adjusted $p$-values. Simulated data consists of $100$ cells, $10000$ genes ($1000$ DE, $9000$ non-DE). Alternatives are simulated using DE : classical difference in expression ($250$ genes), DM : difference in modalities ($250$ genes), DP : difference in proportions ($250$ genes), DB : difference in both modalities and proportions with equal means ($250$ genes). Error rates are computed over $500$ replicates. The truncation parameter is set to $T=4$ for the Gauss-kernel.
  • Figure 3: Top: Hierarchical clustering based on average AUCC scores computed between pairs of methods (over 18 datasets squair_confronting_2021). Bottom: Boxplot of the average expression (left) and proportion of zeros (right) of the top 500 DE genes for different DE methods (over 18 datasets squair_confronting_2021). Red: bulk methods, orange: pseudobulk methods, blue: single-cell methods. The truncation parameter is set to $T=4$ for ktest (only univariate tests were performed).
  • Figure 4: a: Summarized distance graphs between conditions before (left) and after (right) splitting condition 48HREV into populations 48HREV-1 and 48HREV-2. b: Cell densities of all compared conditions, before (left) and after (right) splitting condition 48HREV c: Cell densities of compared conditions projected on the discriminant axis between conditions 48HREV and 48HDIFF (left), 48HREV and 0H (middle) and 48HREV and 24H (right) with highlighted population 48HREV-1. d : Boxplots of the variation of the gene expression along the five populations 0H, 24H, 48HDIFF, 48HREV-1 and 48HREV-2 for the three genes clusters. a,b,c and d are obtained from scRT-qPCR data. The multivariate differential expression analysis was performed with $T=10$.
  • Figure 5: Differential analysis of scChIP-Seq data on breast cancer cells. a. Cell densities of persister cells vs. untreated cells. Sub-populations of untreated cells were identified using 3-component mixture model, that revealed persister-like cells, intermediate and naive cells. b-c-d : violin plots of the top-10 differentially enriched H3K27me3 loci between the 3 sub-populations. Features are designated by the genomic coordinates of the ChIP-Seq peaks. Corresponding overlapping genes are provided in Table \ref{['tab:chip']}. Multivariate (a) and univariate analyses (b-c-d) were performed with $T=5$.
  • ...and 7 more figures