Table of Contents
Fetching ...

t-SNE Exaggerates Clusters, Provably

Noah Bergam, Szymon Snoeck, Nakul Verma

TL;DR

This work proves that t-SNE visualizations can misrepresent input cluster structures and outlier extremity, showing that (i) similarly clustered outputs can come from arbitrarily unclustered inputs and (ii) tiny input perturbations can yield vastly different visualizations. It introduces impostor datasets that share t-SNE embeddings with the original data, exploiting additive and multiplicative invariances to distance, and proves a fundamental limit: stationary t-SNE outputs cannot depict extreme outliers beyond a small constant. The results are supported by formal theorems and empirical demonstrations on real and synthetic data (e.g., single-cell and BBC news embeddings), revealing substantial risks of false positives and misinterpretation. Overall, the paper establishes principled limits on what can be inferred from t-SNE plots and motivates the search for more reliable visualization guarantees and tools for exploratory data analysis.

Abstract

Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.

t-SNE Exaggerates Clusters, Provably

TL;DR

This work proves that t-SNE visualizations can misrepresent input cluster structures and outlier extremity, showing that (i) similarly clustered outputs can come from arbitrarily unclustered inputs and (ii) tiny input perturbations can yield vastly different visualizations. It introduces impostor datasets that share t-SNE embeddings with the original data, exploiting additive and multiplicative invariances to distance, and proves a fundamental limit: stationary t-SNE outputs cannot depict extreme outliers beyond a small constant. The results are supported by formal theorems and empirical demonstrations on real and synthetic data (e.g., single-cell and BBC news embeddings), revealing substantial risks of false positives and misinterpretation. Overall, the paper establishes principled limits on what can be inferred from t-SNE plots and motivates the search for more reliable visualization guarantees and tools for exploratory data analysis.

Abstract

Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.

Paper Structure

This paper contains 16 sections, 19 theorems, 58 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 3

Fix any $n > k > 1$, and $n$-point dataset $X \subset \mathbb{R}^{n-1}$ with partition $C_1 \sqcup \cdots \sqcup C_k = [n]$ such that $|C_{m\in[k]}| > 1$ and $\bar{\mathcal{S}}(X; C_{m\in[k]})$ is well defined. For all $0 < \epsilon \leq 1$, there exists $n$-point dataset $X_\epsilon \subset \mathbb yet, for any $\rho \in (1, n-1)$:

Figures (9)

  • Figure 1: Visualizations of single-cell data (top row) versus an arbitrarily unclustered impostor dataset (bottom row). Based on the 2D t-SNE visualization (left column), it is difficult do distinguish which dataset (real or impostor) may have produced the plot. Plotting the high-dimensional interpoint distances (right column) confirms that the imposter dataset is unclustered in some sense. As a reference we also plot the 2D PCA visualization (center column) to indicate that this issue does not occur with other methods. The numbers on the bottom left of each figure shows the cluster salience in terms of the average silhouette score for the 2D t-SNE plot (left), 2D PCA plot (center), and high-dimensional input (right) for the real dataset (top) and the impostor dataset (bottom). Note that the color coding in all of the scatter plots corresponds to a DBSCAN clustering ester1996density of the top left t-SNE plot.
  • Figure 2: Myriad 2D t-SNE visualizations, all produced by small perturbations of the same 200-point input dataset. Each perturbation satisfies the conditions of Theorem \ref{['thm:perturbhammer']} for $\epsilon = 0.01$.
  • Figure 3: t-SNE versus PCA plots in response to the injection of a single "poison" point in the input dataset. The original dataset, visualized in panels 1 and 3, consists of $400$ points sampled from a mixture of two well-separated Gaussians in $\mathbb{R}^{2000}$. The poison point is then placed at the mean of the previously sampled points; the resulting $401$-point dataset is visualized in panels 2 and 4.
  • Figure 4: t-SNE's versus PCA's response to $\alpha$-outliers. Top row: on a dataset that tracks financial activity, around $1\%$ of which is fraudulent, t-SNE fails while PCA largely succeeds at separating fraudulent (red) from non-fraudulent (black) points. Note that each of the fraudulent data points is an $(\alpha > 0)$-outlier with respect to the non-fraudulent group; the top right figure shows how t-SNE and PCA register those $\alpha$-values in their output. Middle row: a similar analysis on a synthetic dataset comprised of a Gaussian sample plus a single $\alpha$-outlier, with varying values of $\alpha$. Bottom row: mixture of two Gaussians plus 1, 10, and 100 $\alpha$-outliers. Despite a large gap ($\alpha > 1$) between the outliers and the two clusters, t-SNE is unable to separate them.
  • Figure 5: t-SNE's response to the injection of poison points (middle) and $\alpha$-outliers (right) on the BBC News Article dataset. Middle: injecting poison points (red) to the original dataset (black) significantly disrupts the underlying cluster structure. Right: while injecting $(\alpha > 1)$-outliers (red) does not disrupt the underlying cluster structure (black), the extreme outliers themselves are not well separated. The bottom left label in each plot denotes silhouette score of the t-SNE projected original points (without the injected points) with respect to the true labels (business, entertainment, politics, sport, tech).
  • ...and 4 more figures

Theorems & Definitions (37)

  • Definition 1
  • Definition 2
  • Theorem 3
  • Corollary 3
  • Theorem 4
  • Lemma 5
  • Definition 6
  • Theorem 7
  • Theorem 8
  • Corollary 9
  • ...and 27 more