From $t$-SNE to UMAP with contrastive learning

Sebastian Damrich; Jan Niklas Böhm; Fred A. Hamprecht; Dmitry Kobak

From $t$-SNE to UMAP with contrastive learning

Sebastian Damrich, Jan Niklas Böhm, Fred A. Hamprecht, Dmitry Kobak

TL;DR

This work establishes a conceptual bridge between two dominant visualization methods, $t$-SNE and UMAP, by recasting them as instances of contrastive neighbor embeddings (NE) under noise-contrastive estimation (NCE) and negative sampling (NEG). By introducing a generalized NCE with a fixed normalization constant $\bar{Z}$, the authors derive NEG as a proportionality to the data distribution, explain why UMAP yields more compact clusters, and unveil a spectrum of embeddings that interpolate between local and global structure. The study also connects NE to self-supervised learning, presenting InfoNC-$t$-SNE and parametric NE variants, and demonstrates practical viability with a PyTorch framework and empirical results across datasets. Overall, the paper provides a unifying theory for NE methods, clarifies the relationship between $t$-SNE and UMAP, and offers a controllable spectrum to mitigate over-interpretation of visualizations and to adapt embeddings to different structural emphasis.

Abstract

Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between $t$-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noise-contrastive estimation can be used to optimize $t$-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to $t$-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) $t$-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation.

From $t$-SNE to UMAP with contrastive learning

TL;DR

This work establishes a conceptual bridge between two dominant visualization methods,

-SNE and UMAP, by recasting them as instances of contrastive neighbor embeddings (NE) under noise-contrastive estimation (NCE) and negative sampling (NEG). By introducing a generalized NCE with a fixed normalization constant

, the authors derive NEG as a proportionality to the data distribution, explain why UMAP yields more compact clusters, and unveil a spectrum of embeddings that interpolate between local and global structure. The study also connects NE to self-supervised learning, presenting InfoNC-

-SNE and parametric NE variants, and demonstrates practical viability with a PyTorch framework and empirical results across datasets. Overall, the paper provides a unifying theory for NE methods, clarifies the relationship between

-SNE and UMAP, and offers a controllable spectrum to mitigate over-interpretation of visualizations and to adapt embeddings to different structural emphasis.

Abstract

Neighbor embedding methods

-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between

-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noise-contrastive estimation can be used to optimize

-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to

-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond)

-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation.

Paper Structure (39 sections, 3 theorems, 58 equations, 22 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 3 theorems, 58 equations, 22 figures, 5 tables, 1 algorithm.

Introduction
Related work
Background
Noise-contrastive estimation (NCE)
Neighbor embeddings
From noise-contrastive estimation to negative sampling
Negative sampling spectrum
UMAP's conceptual relation to t-SNE
Contrastive NE and contrastive self-supervised learning
Discussion and Conclusion
Probabilistic frameworks of NCE and InfoNCE
NCE
InfoNCE
Gradients
Gradients of MLE, NCE, and NEG
...and 24 more sections

Key Result

Theorem 1

Let $\xi$ have full support and suppose there exists some $\theta^*$ such that $q_{\theta^*}= p$. Then $\theta^*$ is a minimum of and the only other extrema of $\mathcal{L}^{\mathrm{NCE}}$ are minima $\tilde{\theta}$ which also satisfy $q_{\tilde{\theta}} = p$.

Figures (22)

Figure 1: (a -- e) Neg-$t$-SNE embedding spectrum of the MNIST dataset for various values of the fixed normalization constant $\bar{Z}$, see Sec. \ref{['sec:spectrum']}. As $\bar{Z}$ increases, the scale of the embedding decreases, clusters become more compact and separated before eventually starting to merge. The Neg-$t$-SNE spectrum produces embeddings very similar to those of (f)$t$-SNE, (g) NCVis, and (h) UMAP, when $\bar{Z}$ equals the partition function of $t$-SNE, the learned normalization parameter $Z$ of NCVis, or $|X| / m = \binom{n}{2}/m$ used by UMAP, as predicted in Sec. \ref{['sec:NCEtoNEG']}--\ref{['sec:UMAPtotSNE']}. (i) The partition function $\sum_{ij}(1+d_{ij}^2)^{-1}$ tries to match $\bar{Z}$ and grows with it. Here, we initialized all Neg-$t$-SNE runs using $\bar{Z} = |X|/m$; without this 'early exaggeration', low values of $\bar{Z}$ yield fragmented clusters (Fig. \ref{['fig:negtsne_spectrum_pca']}).
Figure 2: Embeddings of the MNIST dataset with UMAP and Neg-$t$-SNE with and without learning rate annealing in our implementation. UMAP does not work well without annealing because it implicitly uses the diverging $1/d_{ij}^2$ kernel in NEG, while Neg-$t$-SNE uses the more numerically stable Cauchy kernel (Sec.~\ref{['sec:UMAPtotSNE']}). UMAP's reference implementation also requires annealing, see Figs. \ref{['subfig:umap_eps_10e-3_no_anneal']}, \ref{['subfig:umap_eps_10e-3_anneal']}.
Figure 3: NE embeddings of MNIST are qualitatively similar in the non-parametric (top) and parametric settings (bottom). We used our PyTorch framework with $m=5$ and batch size $b=1024$.
Figure S1: UMAP embeddings of the MNIST dataset, ablating numerical optimization tricks of UMAP's reference implementation. The learning rate annealing is crucial (bottom row) but safeguarding against divisions by zero in UMAP's repulsive term Eq. (\ref{['eq:umap_rep_grad_eps']}) by adding $\zeta$ to the denominator has little effect. These experiments were run using the reference implementation, modified to change the $\zeta$ value and to optionally switch off the learning rate annealing.
Figure S2: UMAP and Neg-$t$-SNE embeddings of the MNIST dataset using different values $\varepsilon$ at which we clip arguments to logarithm functions. These experiments were done using our implementation. Varying $\varepsilon$ did not strongly influence the appearance of the embedding. But setting $\varepsilon = 0$ led to crashing UMAP runs. Annealing the learning rate is important for UMAP, yet not for Neg-$t$-SNE.
...and 17 more figures

Theorems & Definitions (3)

Theorem 1: gutmann2010noisegutmann2012noise
Corollary 2
Lemma 3

From $t$-SNE to UMAP with contrastive learning

TL;DR

Abstract

From $t$-SNE to UMAP with contrastive learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (3)