Online t-SNE for single-cell RNA-seq

Hui Ma; Kai Chen

Online t-SNE for single-cell RNA-seq

Hui Ma, Kai Chen

TL;DR

Online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch and showcases the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.

Abstract

Due to the sequential sample arrival, changing experiment conditions, and evolution of knowledge, the demand to continually visualize evolving structures of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes indispensable. However, as one of the state-of-the-art visualization and analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding (t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the demand well. To address these challenges, we introduce online t-SNE to seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by leveraging the embedding space of old samples, exploring the embedding space of new samples, and aligning the two embedding spaces on the fly. Consequently, online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch. We showcase the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.

Online t-SNE for single-cell RNA-seq

TL;DR

Abstract

Paper Structure (17 sections, 13 equations, 8 figures, 2 algorithms)

This paper contains 17 sections, 13 equations, 8 figures, 2 algorithms.

Introduction
Theory
t-distributed stochastic neighbor embedding
Online t-distributed stochastic neighbor embedding
Compositional high-dimensional probabilities using Gaussian distribution
Compositional low-dimensional probabilities using Student-$t$ distribution
Online Kullback-Leibler divergence aligning high and low dimensional probabilities
Algorithm of online t-SNE
Computation complexity of online t-SNE
Experiments
Overview of online t-SNE
Offline t-SNE struggles to visualize synthetic sequential dataset
Learning consistent embedding of mouse neocortex cell dataset
Mitigating the batch effect of kidney cell dataset
Exploring shared embedding across diversified pancreatic cell dataset
...and 2 more sections

Figures (8)

Figure 1: Framework of online t-SNE. For online t-SNE, there are three joint probabilities on the high-dimensional data spaces, $p_{ij}$ (in red) of old data $X\in\mathbb{R}^{D}$, $p_{i_{*}j}$ (in orange) between old data and new data $X_{*}\in\mathbb{R}^{D}$, and $p_{i_{*}j_{*}}$ (in blue) of new data. Correspondingly, there are three joint probabilities on the low-dimensional embedding spaces, $q_{ij}$, $q_{i_{*}j}$, and $q_{i_{*}j_{*}}$, which approximate $p_{ij}$, $p_{i_{*}j}$, and $p_{i_{*}j_{*}}$, respectively. $\{C_{i}\}_{i=1}^{3}$ denote the costs of online KL divergences. On the left, we have two subsets of scRNA-seq data (denoted by green matrices), including the old data subset $X$ and the new data subset $X_{*}$. On the right, there are two low-dimensional embedding spaces learned by online t-SNE, $Y\in\mathbb{R}^{2}$ and $Y_{*}\in\mathbb{R}^{2}$, for $X$ and $X_{*}$, respectively. The black and purple arrows represent the knowledge flow from high-dimensional old data and new data to their low-dimensional embeddings, respectively. $\{C_{i}\}_{i=1}^{3}$ denote the costs between the probabilities of high-dimensional data and the probabilities of low-dimensional embedding. Green, blue, and magenta points in $Y$ and $Y_{*}$ denote the low-dimensional embeddings of scRNA-seq samples. Offline t-SNE (within a dashed rectangle) is a special case of online t-SNE, solely focusing on $p_{ij}$ and $q_{ij}$.
Figure 2: Learning process of online t-SNE. Given the low-dimensional embeddings $Y$ of old data, online t-SNE iteratively visualizes $X_{*}$ in real-time and avoids time-consuming retraining on the set $\{X, X_{*}\}$. Online t-SNE iteratively handles new data $X_{*}$ in real-time, leveraging the low-dimensional embeddings $Y$ of old data and avoiding time-consuming retraining on the set $\{X, X_{*}\}$. Offline t-SNE (within a dashed rectangle) learns the low-dimensional embeddings $Y$ of old data $X$. On the right, online t-SNE incorporates old data and its embeddings to learn new low-dimensional embeddings $Y_{*}$ of new data $X_{*}$. The blue and purple arrows denote the data processing direction of offline t-SNE and online t-SNE, respectively. In particular, the purple arrow (middle and top) signifies the data interaction between old data and new data to compute the high-dimensional joint distribution $p_{i_{*} j}$. Another purple arrow (middle and bottom) signifies the embedding interaction between the embeddings of old data and new data to compute the low-dimensional joint distribution $q_{i_{*} j}$.
Figure 3: Visualization of synthetic sequential data: (a) visualization of the old data subset ($n$=700) using offline t-SNE, (b) visualization of the new data subset ($n$=300) using offline t-SNE with random initialization, (c) visualization of the new data subset using offline t-SNE with PCA initialization. The shade points shown in subplots (b) and (c) denote embeddings of the old data. We use 10 colors to label 10 different clusters, respectively. offline t-SNE cannot align the embeddings of the new data in both subplots (b) and (c) with the embeddings of the old data in subplot (a).
Figure 4: Visualization of adult mouse neocortex cell dataset: (a) visualization using offline t-SNE on the collection of old and new data. (b) visualization using online t-SNE on the new data. The shade nodes shown in subplot (b) denote the learned embeddings using offline t-SNE on old data. Cell types are colored by: Pvalb Gpr149 Islr, light blue; CR Lhx5, dark blue; L2/3 IT ALM Macc1 Lrg1, light green; L5 IT ALM Lypd1 Gpr88, dark green; L5 IT VISp Whrn Tox2, pink; Lamp5 Plch2 Dock5, red; L6 IT ALM Oprk1, orange; L2/3 IT VISp Adamts2, orange-yellow; L5 IT ALM Npw, light purple; Sst Myh8 Fibin, dark purple; Pvalb Reln Itm2a, light yellow; Lamp5 Ntn1 Npy2r, brown; and L5 IT ALM Pld5, turquoise. Online t-SNE on the new data can leverage the learned embeddings of old data without retraining from scratch.
Figure 5: Visualization using online t-SNE on renal cell dataset with two batches: (a) visualization of the first batch. (b) visualization of the second batch. Colors denote the types as: PTC, light blue; LoH.TAL, dark blue; CD4.T.cells, light green; IC.A, dark green; Macro., pink; DC, red; LoH.DTL, orange; vSMC, orange-yellow; EC.glom, light purple; LoH.ATL, dark purple; and PC.CD, light yellow.
...and 3 more figures

Online t-SNE for single-cell RNA-seq

TL;DR

Abstract

Online t-SNE for single-cell RNA-seq

Authors

TL;DR

Abstract

Table of Contents

Figures (8)