Table of Contents
Fetching ...

Online t-SNE for single-cell RNA-seq

Hui Ma, Kai Chen

TL;DR

Online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch and showcases the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.

Abstract

Due to the sequential sample arrival, changing experiment conditions, and evolution of knowledge, the demand to continually visualize evolving structures of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes indispensable. However, as one of the state-of-the-art visualization and analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding (t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the demand well. To address these challenges, we introduce online t-SNE to seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by leveraging the embedding space of old samples, exploring the embedding space of new samples, and aligning the two embedding spaces on the fly. Consequently, online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch. We showcase the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.

Online t-SNE for single-cell RNA-seq

TL;DR

Online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch and showcases the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.

Abstract

Due to the sequential sample arrival, changing experiment conditions, and evolution of knowledge, the demand to continually visualize evolving structures of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes indispensable. However, as one of the state-of-the-art visualization and analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding (t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the demand well. To address these challenges, we introduce online t-SNE to seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by leveraging the embedding space of old samples, exploring the embedding space of new samples, and aligning the two embedding spaces on the fly. Consequently, online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch. We showcase the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.
Paper Structure (17 sections, 13 equations, 8 figures, 2 algorithms)

This paper contains 17 sections, 13 equations, 8 figures, 2 algorithms.

Figures (8)

  • Figure 1: Framework of online t-SNE. For online t-SNE, there are three joint probabilities on the high-dimensional data spaces, $p_{ij}$ (in red) of old data $X\in\mathbb{R}^{D}$, $p_{i_{*}j}$ (in orange) between old data and new data $X_{*}\in\mathbb{R}^{D}$, and $p_{i_{*}j_{*}}$ (in blue) of new data. Correspondingly, there are three joint probabilities on the low-dimensional embedding spaces, $q_{ij}$, $q_{i_{*}j}$, and $q_{i_{*}j_{*}}$, which approximate $p_{ij}$, $p_{i_{*}j}$, and $p_{i_{*}j_{*}}$, respectively. $\{C_{i}\}_{i=1}^{3}$ denote the costs of online KL divergences. On the left, we have two subsets of scRNA-seq data (denoted by green matrices), including the old data subset $X$ and the new data subset $X_{*}$. On the right, there are two low-dimensional embedding spaces learned by online t-SNE, $Y\in\mathbb{R}^{2}$ and $Y_{*}\in\mathbb{R}^{2}$, for $X$ and $X_{*}$, respectively. The black and purple arrows represent the knowledge flow from high-dimensional old data and new data to their low-dimensional embeddings, respectively. $\{C_{i}\}_{i=1}^{3}$ denote the costs between the probabilities of high-dimensional data and the probabilities of low-dimensional embedding. Green, blue, and magenta points in $Y$ and $Y_{*}$ denote the low-dimensional embeddings of scRNA-seq samples. Offline t-SNE (within a dashed rectangle) is a special case of online t-SNE, solely focusing on $p_{ij}$ and $q_{ij}$.
  • Figure 2: Learning process of online t-SNE. Given the low-dimensional embeddings $Y$ of old data, online t-SNE iteratively visualizes $X_{*}$ in real-time and avoids time-consuming retraining on the set $\{X, X_{*}\}$. Online t-SNE iteratively handles new data $X_{*}$ in real-time, leveraging the low-dimensional embeddings $Y$ of old data and avoiding time-consuming retraining on the set $\{X, X_{*}\}$. Offline t-SNE (within a dashed rectangle) learns the low-dimensional embeddings $Y$ of old data $X$. On the right, online t-SNE incorporates old data and its embeddings to learn new low-dimensional embeddings $Y_{*}$ of new data $X_{*}$. The blue and purple arrows denote the data processing direction of offline t-SNE and online t-SNE, respectively. In particular, the purple arrow (middle and top) signifies the data interaction between old data and new data to compute the high-dimensional joint distribution $p_{i_{*} j}$. Another purple arrow (middle and bottom) signifies the embedding interaction between the embeddings of old data and new data to compute the low-dimensional joint distribution $q_{i_{*} j}$.
  • Figure 3: Visualization of synthetic sequential data: (a) visualization of the old data subset ($n$=700) using offline t-SNE, (b) visualization of the new data subset ($n$=300) using offline t-SNE with random initialization, (c) visualization of the new data subset using offline t-SNE with PCA initialization. The shade points shown in subplots (b) and (c) denote embeddings of the old data. We use 10 colors to label 10 different clusters, respectively. offline t-SNE cannot align the embeddings of the new data in both subplots (b) and (c) with the embeddings of the old data in subplot (a).
  • Figure 4: Visualization of adult mouse neocortex cell dataset: (a) visualization using offline t-SNE on the collection of old and new data. (b) visualization using online t-SNE on the new data. The shade nodes shown in subplot (b) denote the learned embeddings using offline t-SNE on old data. Cell types are colored by: Pvalb Gpr149 Islr, light blue; CR Lhx5, dark blue; L2/3 IT ALM Macc1 Lrg1, light green; L5 IT ALM Lypd1 Gpr88, dark green; L5 IT VISp Whrn Tox2, pink; Lamp5 Plch2 Dock5, red; L6 IT ALM Oprk1, orange; L2/3 IT VISp Adamts2, orange-yellow; L5 IT ALM Npw, light purple; Sst Myh8 Fibin, dark purple; Pvalb Reln Itm2a, light yellow; Lamp5 Ntn1 Npy2r, brown; and L5 IT ALM Pld5, turquoise. Online t-SNE on the new data can leverage the learned embeddings of old data without retraining from scratch.
  • Figure 5: Visualization using online t-SNE on renal cell dataset with two batches: (a) visualization of the first batch. (b) visualization of the second batch. Colors denote the types as: PTC, light blue; LoH.TAL, dark blue; CD4.T.cells, light green; IC.A, dark green; Macro., pink; DC, red; LoH.DTL, orange; vSMC, orange-yellow; EC.glom, light purple; LoH.ATL, dark purple; and PC.CD, light yellow.
  • ...and 3 more figures