Table of Contents
Fetching ...

Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective

Zhexuan Liu, Rong Ma, Yiqiao Zhong

TL;DR

This work tackles the reliability of neighbor-embedding visualizations (e.g., t-SNE, UMAP) by reframing embeddings through a data-independent, continuous map via leave-one-out analysis. It introduces the LOO-map, two forms of map discontinuity (OI and FI), and two label-free scores (perturbation and singularity) to diagnose global and local distortions and to guide hyperparameter tuning. Empirical validation across synthetic and real datasets demonstrates that the LOO assumption holds and that the proposed scores effectively detect topology changes and inform perplexity choices, improving interpretability in biomedical and computer-vision contexts. The approach yields a practical, wrapper-based toolkit (with an accompanying R package) to produce more faithful visualizations and robust hyperparameter selection for high-dimensional data visualizations.

Abstract

Visualizing high-dimensional data is essential for understanding biomedical data and deep learning models. Neighbor embedding methods, such as t-SNE and UMAP, are widely used but can introduce misleading visual artifacts. We find that the manifold learning interpretations from many prior works are inaccurate and that the misuse stems from a lack of data-independent notions of embedding maps, which project high-dimensional data into a lower-dimensional space. Leveraging the leave-one-out principle, we introduce LOO-map, a framework that extends embedding maps beyond discrete points to the entire input space. We identify two forms of map discontinuity that distort visualizations: one exaggerates cluster separation and the other creates spurious local structures. As a remedy, we develop two types of point-wise diagnostic scores to detect unreliable embedding points and improve hyperparameter selection, which are validated on datasets from computer vision and single-cell omics.

Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective

TL;DR

This work tackles the reliability of neighbor-embedding visualizations (e.g., t-SNE, UMAP) by reframing embeddings through a data-independent, continuous map via leave-one-out analysis. It introduces the LOO-map, two forms of map discontinuity (OI and FI), and two label-free scores (perturbation and singularity) to diagnose global and local distortions and to guide hyperparameter tuning. Empirical validation across synthetic and real datasets demonstrates that the LOO assumption holds and that the proposed scores effectively detect topology changes and inform perplexity choices, improving interpretability in biomedical and computer-vision contexts. The approach yields a practical, wrapper-based toolkit (with an accompanying R package) to produce more faithful visualizations and robust hyperparameter selection for high-dimensional data visualizations.

Abstract

Visualizing high-dimensional data is essential for understanding biomedical data and deep learning models. Neighbor embedding methods, such as t-SNE and UMAP, are widely used but can introduce misleading visual artifacts. We find that the manifold learning interpretations from many prior works are inaccurate and that the misuse stems from a lack of data-independent notions of embedding maps, which project high-dimensional data into a lower-dimensional space. Leveraging the leave-one-out principle, we introduce LOO-map, a framework that extends embedding maps beyond discrete points to the entire input space. We identify two forms of map discontinuity that distort visualizations: one exaggerates cluster separation and the other creates spurious local structures. As a remedy, we develop two types of point-wise diagnostic scores to detect unreliable embedding points and improve hyperparameter selection, which are validated on datasets from computer vision and single-cell omics.

Paper Structure

This paper contains 38 sections, 1 theorem, 56 equations, 20 figures, 8 tables.

Key Result

Theorem 1

Consider the LOO loss function for t-SNE given in Equation def:tsne-w and def:loo. Under the assumptions stated above, the negative gradient of the loss is where ${\mathbf{y}}_{\mathbin{\!/\mkern-5mu/\!}} = {\mathbf{\uptheta}}{\mathbf{\uptheta}}^\top {\mathbf{y}} / \| {\mathbf{\uptheta}} \|^2$ is projection of ${\mathbf{y}}$ in the direction of ${\mathbf{\uptheta}}$, and ${\mathbf{y}}_{\bot} =

Figures (20)

  • Figure 1: Overview: assessment of embeddings generated by neighbor embedding methods, illustrated with image data. a We use a standard pre-trained convolutional neural network (CNN) to obtain features of image samples from the CIFAR10 dataset, and then visualize the features using a neighbor embedding method, specifically t-SNE. b Basic ideas of singularity scores and perturbation scores. c t-SNE tends to embed image features into separated clusters even for images with ambiguous semantic meanings (as quantified by higher entropies of predicted class probabilities by the CNN). Perturbation scores identify the embedding points that have ambiguous class membership but less visual uncertainty. d An incorrect choice of perplexity leads to visual fractures (FI discontinuity), which is more severe with a smaller perplexity. We recommend choosing the perplexity no smaller than the elbow point.
  • Figure 2: Diagrams showing the idea of Leave-one-out (LOO) and LOO-map.a Idea of LOO. Adding one input point does not significantly change the overall positions of embedding points. The assumption allows us to analyze the properties of the embedding map over the entire input space via an approximated loss which we call LOO loss. b We introduce a global embedding map (LOO-map) ${\mathbf{f}}({\mathbf{x}}) = {\rm argmin}_{{\mathbf{y}}}L({\mathbf{y}};{\mathbf{x}})$ defined in the entire input space as an approximation to the neighbor embedding method $\mathcal{A}$.
  • Figure 3: LOO loss landscape reveals the origins of two distortion patterns.a We illustrate two discontinuity patterns on simulated Gaussian mixture data. OI discontinuity: t-SNE embeds points into well-separated clusters and creates visual overconfidence. FI discontinuity: t-SNE with an inappropriate perplexity creates many artificial fractures. b Origin of OI discontinuity: LOO loss contour plot shows distantly separated minima. We add a new input point ${\mathbf{x}}$ at one of the $4$ interpolated locations ${\mathbf{x}} = t{\mathbf{c}}_1 + (1-t){\mathbf{c}}_2$ where $t\in\{0,0.47,0.48,1\}$ and then visualize the landscape of the LOO loss $L({\mathbf{y}}; {\mathbf{x}})$ using contour plots in the space of ${\mathbf{y}}$. The middle two plots exhibit two well-separated minima (orange triangle), which cause a huge jump of the embedding point (as the minimizer of the LOO loss) under a small perturbation of ${\mathbf{x}}$. c Origin of FI discontinuity: We show LOO loss contour plots with interpolation coefficient $t\in\{0.2,0.4,0.6,0.8\}$. The plots show many local minima and irregular jumps. Under an inappropriate perplexity, the loss landscape is consistently fractured. Numerous local minima cause an uneven trajectory of embedding points (dashed line) when adding ${\mathbf{x}}$ at evenly interpolated locations.
  • Figure 4: Simulation studies demonstrate the effectiveness of proposed scores.a Perturbation scores identify unreliable embedding points that have reduced uncertainty. Input points from 5-component Gaussian mixture data form separated clusters in the embedding space. t-SNE reduces perceived uncertainty for input points in the overlapping region (left), as captured by the label-dependent measurements, namely the entropy difference (middle). Our perturbation scores can identify the same unreliable embedding points without label information (right). b-c Singularity scores reveal spurious sub-clusters on Gaussian mixture data (b) and Swiss roll data (c). At a low perplexity, t-SNE creates many spurious sub-clusters. Embedding points receiving high singular scores at random locations are an indication of such spurious structures.
  • Figure 5: Perturbation scores detect out-of-distribution (OOD) image data.a We use a pretrained ResNet-18 model to extract features of CIFAR-10 images and, as out-of-distribution data, of DTD texture images. Then we visualize the features using t-SNE with perplexity 100. A fraction of OOD embedding points are absorbed into clusters that represent CIFAR-10 image categories such as deer, truck, and automobile. b-d Perturbation scores can effectively identify misplaced out-of-distribution data points. The ROC curves show the proportion of OOD points correctly identified by the perturbation scores.
  • ...and 15 more figures

Theorems & Definitions (1)

  • Theorem 1