Table of Contents
Fetching ...

Uncertainty-aware t-distributed Stochastic Neighbor Embedding for Single-cell RNA-seq Data

Hui Ma, Kai Chen

TL;DR

This work introduces uncertainty-aware t-SNE (Ut-SNE), a noise-defending visualization tool tailored for uncertain single-cell RNA-seq data that accurately incorporates noise about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data, revealing significant uncertainties in transcriptomic variability.

Abstract

Nonlinear data visualization using t-distributed stochastic neighbor embedding (t-SNE) enables the representation of complex single-cell transcriptomic landscapes in two or three dimensions to depict biological populations accurately. However, t-SNE often fails to account for uncertainties in the original dataset, leading to misleading visualizations where cell subsets with noise appear indistinguishable. To address these challenges, we introduce uncertainty-aware t-SNE (Ut-SNE), a noise-defending visualization tool tailored for uncertain single-cell RNA-seq data. By creating a probabilistic representation for each sample, Our Ut-SNE accurately incorporates noise about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data, revealing significant uncertainties in transcriptomic variability. Through various examples, we showcase the practical value of Ut-SNE and underscore the significance of incorporating uncertainty awareness into data visualization practices. This versatile uncertainty-aware visualization tool can be easily adapted to other scientific domains beyond single-cell RNA sequencing, making them valuable resources for high-dimensional data analysis.

Uncertainty-aware t-distributed Stochastic Neighbor Embedding for Single-cell RNA-seq Data

TL;DR

This work introduces uncertainty-aware t-SNE (Ut-SNE), a noise-defending visualization tool tailored for uncertain single-cell RNA-seq data that accurately incorporates noise about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data, revealing significant uncertainties in transcriptomic variability.

Abstract

Nonlinear data visualization using t-distributed stochastic neighbor embedding (t-SNE) enables the representation of complex single-cell transcriptomic landscapes in two or three dimensions to depict biological populations accurately. However, t-SNE often fails to account for uncertainties in the original dataset, leading to misleading visualizations where cell subsets with noise appear indistinguishable. To address these challenges, we introduce uncertainty-aware t-SNE (Ut-SNE), a noise-defending visualization tool tailored for uncertain single-cell RNA-seq data. By creating a probabilistic representation for each sample, Our Ut-SNE accurately incorporates noise about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data, revealing significant uncertainties in transcriptomic variability. Through various examples, we showcase the practical value of Ut-SNE and underscore the significance of incorporating uncertainty awareness into data visualization practices. This versatile uncertainty-aware visualization tool can be easily adapted to other scientific domains beyond single-cell RNA sequencing, making them valuable resources for high-dimensional data analysis.
Paper Structure (10 sections, 10 equations, 5 figures, 1 algorithm)

This paper contains 10 sections, 10 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Overview of Ut-SNE for uncertainty aware visualization: Ut-SNE aims to represent uncertain high-dimensional scRNA-seq data, in 2D or 3D low-dimensional embedding space while maintaining their original uncertainty and structure. Standard t-SNE achieves this by constructing $k$-nearest neighbor graphs to summarize the data manifold. However, t-SNE computes deterministic distances within neighborhoods, which can overlook uncertainty in the original space - a key feature of the data. To address this, we introduce a general, uncertain, integrable measure of similarity. The process of Ut-SNE begins with a matrix of cell dataset (denoted by a purple matrix) and a pair probabilistic matrix to calculate the uncertain pair distance matrix, which represents high-dimensional joint distribution. This is then aligned with a low-dimensional joint distribution obtained by optimizing point coordinates to preserve local uncertain distances between neighbors.
  • Figure 2: Uncertainty-aware visualization provides a more accurate representation of the inherent uncertain structure within synthetic datasets when compared to conventional methods. The point clouds of the synthetic dataset are sampled from a mixture of Gaussian in 20 dimensions. The visualization results of the synthetic dataset include: (a) visualization of four points using standard t-SNE; (b) visualization of four points with different uncertainties (indicated by isolines) using Ut-SNE; (c) visualization of the point clouds using standard t-SNE; (d) visualization of the point clouds and its probability $\frac{1}{n}\sum_{i=1}^{n}\mathcal{N}({\mathbf{y}}_{i}; {\mathbf{y}}_{i}^{*}, \mathrm{Var}({\mathbf{y}}_{i}))$ using Ut-SNE. While standard t-SNE cannot visualize uncertainty, where the low dimension embeddings of points are deterministic and do not correspond to its uncertainty in the original space, Ut-SNE captures the uncertainties of the data's structure and provides a probabilistic visualization of each point as well as the cluster of the point cloud.
  • Figure 3: Visualization on the embedding of breast cell dataset. We label clusters with cell types including $B$ cell (in magenta), TAM (in red), $T$ cell (in gray-blue), CAF (in cyan), and plasma cell (in purple). These visualizations included: (a) standard t-SNE with the random initialization; (b) Ut-SNE with random initialization; (c) standard t-SNE with PCA initialization; and (d) Ut-SNE with the same PCA initialization. While standard t-SNE often produces scattered visualizations, where the apparent size of a point cluster (distinguished by different colors) does not reflect the space it occupies in the original data, Ut-SNE more accurately captures the true structure by incorporating uncertainty information.
  • Figure 4: Visualization on the distribution of adipocyte cell dataset. In contrast to standard t-SNE, Ut-SNE provides a probabilistic representation of the low-dimensional embedding. In subplot (b) (and subsequently) the uncertainty of the low-dimensional embedding is visualized by varying the intensity of the background pixels. The darker the pixel the higher the precision of the mapping. Subplot (b) visualizes the embeddings of the preadipocytes. Every preadipocyte cell has an underlying distribution that can be projected as subplot (a) (For simplicity, we show the sampled embeddings). Subplots (c) and (d) visualize the distributions of different cells, respectively.
  • Figure 5: Visualization on the embedding of adipocyte cell dataset.