Table of Contents
Fetching ...

Unsupervised visualization of image datasets using contrastive learning

Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

TL;DR

t-SimCNE addresses the failure of pixel-space visualizations for image data by marrying contrastive learning with a 2D neighbor-embedding objective to create a parametric 2D embedding. By replacing the high-dimensional projection with a 2D output and using a Euclidean distance paired with a Cauchy kernel, it yields informative visualizations that preserve semantic structure and support out-of-sample points. On CIFAR-10 and CIFAR-100, the method achieves competitive 2D embeddings and reveals meaningful subclass structures, artifacts, and inter-class relations, offering a valuable tool for exploratory data analysis and quality control. A dimensionality-annealing training strategy—pretraining in 128D before fine-tuning to 2D—significantly improves embedding quality and clustering, suggesting paths for applying t-SimCNE to larger datasets and scientific domains.

Abstract

Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by self-supervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. T-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.

Unsupervised visualization of image datasets using contrastive learning

TL;DR

t-SimCNE addresses the failure of pixel-space visualizations for image data by marrying contrastive learning with a 2D neighbor-embedding objective to create a parametric 2D embedding. By replacing the high-dimensional projection with a 2D output and using a Euclidean distance paired with a Cauchy kernel, it yields informative visualizations that preserve semantic structure and support out-of-sample points. On CIFAR-10 and CIFAR-100, the method achieves competitive 2D embeddings and reveals meaningful subclass structures, artifacts, and inter-class relations, offering a valuable tool for exploratory data analysis and quality control. A dimensionality-annealing training strategy—pretraining in 128D before fine-tuning to 2D—significantly improves embedding quality and clustering, suggesting paths for applying t-SimCNE to larger datasets and scientific domains.

Abstract

Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by self-supervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. T-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.
Paper Structure (13 sections, 4 equations, 22 figures, 2 tables)

This paper contains 13 sections, 4 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Left: $t$nobreakSimCNE. Two augmentations of the same image are fed through the same ResNet and fully-connected projection head to get representations $\mathbf z_i$ and $\mathbf z_j$. The loss function pushes $\mathbf z_i$ and $\mathbf z_j$ together to maximize their Cauchy similarity. Middle: Embedding of CIFAR-10. The dashed arrows point to the locations of $\mathbf z_i$ and $\mathbf z_j$ from the left. Right: Training loss. The optimization consists of three stages: (1) pre-training with a 128D output for 1000 epochs; (2) fine-tuning only the 2D readout layer for 50 epochs; and (3) fine-tuning the entire network for 450 epochs.
  • Figure 2: Different training strategies for $t$nobreakSimCNE on CIFAR-10. (a) Optimizing the 2D Euclidean loss for 1000 epochs. (b) Optimizing the 2D Euclidean loss for 5000 epochs. (c) Pretraining with cosine loss in 128D and fine-tuning with Euclidean loss in 2D. (d) Pretraining with Euclidean loss in 128D and fine-tuning with Euclidean loss in 2D.
  • Figure 4: $L_2$ norms of the representation in 128-dimensional $Z$ space after training on CIFAR-10 with the Euclidean loss (a) and with the cosine loss (b). Standard SimCLR uses the cosine loss. Colors as in \ref{['fig:cifar10.gallery']}. There was a similar difference in the $H$ space, but less pronounced.
  • Figure 5: Annotated $t$nobreakSimCNE embedding of the CIFAR-10 dataset. We manually annotated some of the prominent clusters by inspecting the images. Shown images are a random selection from the 15 nearest neighbors of the line tip.
  • Figure 6: Annotated $t$nobreakSimCNE embedding of the CIFAR-100 dataset. Class labels were positioned on the periphery in the order of $\mathrm{atan2}(y,x)$ where $(x,y)$ is the mode of the kernel density estimate of embedding coordinates within each class.
  • ...and 17 more figures