Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

Rong Ma; Eric D. Sun; David Donoho; James Zou

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

Rong Ma, Eric D. Sun, David Donoho, James Zou

TL;DR

A spectral manifold alignment and inference (SMAI) framework is presented, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data and improves various downstream analyses, providing further biological insights.

Abstract

Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

TL;DR

Abstract

Paper Structure (16 sections, 3 theorems, 28 equations, 21 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 3 theorems, 28 equations, 21 figures, 2 tables, 2 algorithms.

Supplementary Notes
Advantages of SMAI-test as suggested by statistical theory
SMAI-align algorithm: technical details
Rationale and geometric interpretation of SMAI-align
Simulation I: empirical consistency and statistical validity of SMAI
Simulation II: consistency and type I errors
Determine the rank parameter $r_{\max}$
Unequal sample sizes
Partial alignability testing and partial alignment
Simulation III: DE gene detection
Computing time
Nonlinear extensions of SMAI
Synthetic data
Implementation details
Proof of Theorem \ref{['test.thm']}
...and 1 more sections

Key Result

Theorem 1

Suppose $\bar{\bold{X}}$ and $\bar{\bold{Y}}$ are independent and satisfy the above high-dimensional generalized spiked population model. Let $p(\bar{\bold{X}},\bar{\bold{Y}},r_{\max}, \beta)$ be the p-value returned by SMAI-test with $(r_{\max},b)=(r,1/\beta)$. Under the null hypothesis $H_0$, for

Figures (21)

Figure 1: Overview and illustration of the SMAI algorithm. (a) SMAI-test imposes a low-rank spiked covariance matrix model where the low-dimensional signal structures of data matrices are encoded by a few largest eigenvalues of the population covariance matrices. Under the null hypothesis that the underlying signal structures are alignable up to a similarity transformation, a test statistic based on comparing the leading eigenvalues of the empirical covariance matrices is computed, whose theoretical null distribution as $(n,d)\to\infty$ is derived using random matrix theory. The final p-value returned by SMAI-test is used to infer the alignability of the two datasets. (b) SMAI-align aims to solve the shuffled Procrustes optimization problem (\ref{['align.opt']}). To do so, SMAI-align starts with a denoising procedure, and then adopts an iterative spectral algorithm to achieve similarity matching between the two datasets using high-dimensional Procrustes analysis. The method returns an integrated dataset containing all the samples with the original features, along with a closed-form alignment function, which is interpretable and can be readily used for various downstream analyses.
Figure 2: Forcing uncertified data integration may cause false alignment, serious distortions and misleading inferences. (a) UMAP visualizations of the original (pooled) data under negative control task Neg1, and the integrated data as obtained by five popular methods (Scanorama, Harmony, LIGER, fastMNN and Seurat). For each method, the top figure is colored to indicate the distinct datasets being aligned, whereas the bottom figure colored to indicate different cell types. See Figures S3a and S3b for similar results about Pamona and SCOT, and the results about Neg2 and Neg3. (b) Under the three negative control tasks, we show barplots of Kendall's tau correlations between relative distances among the cells before integration and the distances after data integration, as achieved by each methods. The red dashed line benchmarks the average Kendall's tau correlation of 0.9 achieved by SMAI-align over the positive control tasks Pos1-Pos7. (c) Boxplots of Jaccard similarity between the set of differentially expressed (DE) genes associated with a distinct cell type detected based on the integrated data and the DE genes based on the original data. Each point represents a cell type. See Figure S3c and S3d for similar results about Pamona and SCOT. (d) For Task Neg1, we show some representative barplots of $(1-$false discovery proportion) (1$-$FDP) and the power of detecting DE genes for some cell types based on the integrated data. Harmony is not included in (c) and (d) as its integration is only achieved in the low-dimensional space. Notably, SMAI-test correctly detects that all the datasets in Tasks Neg1-Neg3 are not alignable.
Figure 3: Performance of SMAI-align on the six positive control integration tasks. (a) Compared with the six existing algorithms (black), SMAI-align (red) has an overall best performance in preserving the within-data structures after integration while achieving a competitive performance in removing the unwanted variations. The former is characterized by the highest Kendall's tau correlations between the relative distances of the cells within a dataset before integration and the distances after integration (y-axis), whereas the latter is reflected by higher values of the batch-associated Davies-Bouldin (D-B) index (x-axis shown in log-scale). See Figure S5 for more comparisons about additional datasets and metrics. (b) UMAP visualizations of the integrated data as obtained by SMAI-align. For each integration task, the top figure is colored to indicate the two datasets being aligned, whereas the bottom figure is colored to indicate different cell types. See also Figure S7 for UMAP visualizations associated with other integration methods.
Figure 4: SMAI improves reliability and power of DE analysis. (a) Boxplots of Jaccard similarity between the DE genes for each cell type identified based on the integrated data, obtained by one of the six integration methods, and the genes identified based on the individual datasets before integration. Each point represents a distinct cell type. The results indicate that SMAI-align oftentimes leads to more consistent and more reliable characterization of DE genes, as compared with other methods. (b) Boxplots of log-expression levels of ITGB1 as grouped by cell types in the two datasets about human PBMCs (Task Pos3: Data 1 contains 219 natural killer (NK) cells and 3143 other cells, and Data 2 contains 194 NK cells and 3028 other cells), and in the integrated datasets (413 NK cells and 6171 other cells) as produced by SMAI-align, Seurat, fastMNN, and Scanorama. The DE pattern of ITGB1 is only preserved by SMAI after integration. (c) Boxplots of log-expression levels of FOLR3 as grouped by cell types in the two datasets about human lung tissues (Task Pos5: Data 1 contains 68 macrophages and 2285 other cells, and Data 2 contains 911 macrophages and 1000 other cells), and in the integrated datasets (979 macrophages and 3285 other cells) as produced by SMAI-align, Seurat, fastMNN, and Scanorama. Artificial DE patterns are created by existing integration methods. The stars above the boxplots indicate statistical significance of DE test. Specifically, * means adjusted p-value $<0.05$; ** means adjusted p-value $<0.01$; *** means adjusted p-value $<0.001$. Harmony and LIGER are not included in (b) and (c) as they do not produce gene-specific integrated data.
Figure 5: SMAI improves prediction of single-cell spatial transcriptomic data. (a) Boxplots of Kendall's tau correlation between the actual expression levels of the spatial genes and their predicted values based on the two-step procedure (alignment followed by $k$ nearest neighbor regression) where the data alignment is achieved by LIGER, Scanorama, Harmony, Seurat, fastMNN, or SMAI-align. Each point represents a distinct spatial gene. The methods are ordered according to their median predictive performance, showing the overall best performance of SMAI. (b) Examples of true expression levels of some spatial genes from Task PosS3, presented according to the cells' spatial layout, and their predicted values based on SMAI-align. The colors are in log scale.
...and 16 more figures

Theorems & Definitions (3)

Theorem 1
Theorem A.1
Lemma A.2

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

TL;DR

Abstract

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (3)