Table of Contents
Fetching ...

Understanding Contrastive Learning through Variational Analysis and Neural Network Optimization Perspectives

Jeff Calder, Wonjun Lee

TL;DR

This work analyzes SimCLR through a variational lens and neural-network optimization dynamics, revealing that minimizing the NT-Xent loss alone can yield invariant minimizers independent of the data distribution, while the training dynamics of neural networks inject information about the data geometry into the latent space. By formulating a generalized loss with a neural-kernel perspective and studying a one-hidden-layer network, the authors show how cluster structure can persist during training and how gradient flow, especially under infinite-width limits, can emphasize or suppress contributions from different clusters. The results connect theoretical optimality conditions with practical training dynamics, explaining why contrastive learning often yields meaningful embeddings despite potential ill-posedness of the objective. These insights offer a principled view of when and how contrastive methods uncover latent structure and suggest directions for rigorously analyzing training dynamics in mean-field and infinite-width regimes.

Abstract

The SimCLR method for contrastive learning of invariant visual representations has become extensively used in supervised, semi-supervised, and unsupervised settings, due to its ability to uncover patterns and structures in image data that are not directly present in the pixel representations. However, the reason for this success is not well-explained, since it is not guaranteed by invariance alone. In this paper, we conduct a mathematical analysis of the SimCLR method with the goal of better understanding the geometric properties of the learned latent distribution. Our findings reveal two things: (1) the SimCLR loss alone is not sufficient to select a good minimizer -- there are minimizers that give trivial latent distributions, even when the original data is highly clustered -- and (2) in order to understand the success of contrastive learning methods like SimCLR, it is necessary to analyze the neural network training dynamics induced by minimizing a contrastive learning loss. Our preliminary analysis for a one-hidden layer neural network shows that clustering structure can present itself for a substantial period of time during training, even if it eventually converges to a trivial minimizer. To substantiate our theoretical insights, we present numerical results that confirm our theoretical predictions.

Understanding Contrastive Learning through Variational Analysis and Neural Network Optimization Perspectives

TL;DR

This work analyzes SimCLR through a variational lens and neural-network optimization dynamics, revealing that minimizing the NT-Xent loss alone can yield invariant minimizers independent of the data distribution, while the training dynamics of neural networks inject information about the data geometry into the latent space. By formulating a generalized loss with a neural-kernel perspective and studying a one-hidden-layer network, the authors show how cluster structure can persist during training and how gradient flow, especially under infinite-width limits, can emphasize or suppress contributions from different clusters. The results connect theoretical optimality conditions with practical training dynamics, explaining why contrastive learning often yields meaningful embeddings despite potential ill-posedness of the objective. These insights offer a principled view of when and how contrastive methods uncover latent structure and suggest directions for rigorously analyzing training dynamics in mean-field and infinite-width regimes.

Abstract

The SimCLR method for contrastive learning of invariant visual representations has become extensively used in supervised, semi-supervised, and unsupervised settings, due to its ability to uncover patterns and structures in image data that are not directly present in the pixel representations. However, the reason for this success is not well-explained, since it is not guaranteed by invariance alone. In this paper, we conduct a mathematical analysis of the SimCLR method with the goal of better understanding the geometric properties of the learned latent distribution. Our findings reveal two things: (1) the SimCLR loss alone is not sufficient to select a good minimizer -- there are minimizers that give trivial latent distributions, even when the original data is highly clustered -- and (2) in order to understand the success of contrastive learning methods like SimCLR, it is necessary to analyze the neural network training dynamics induced by minimizing a contrastive learning loss. Our preliminary analysis for a one-hidden layer neural network shows that clustering structure can present itself for a substantial period of time during training, even if it eventually converges to a trivial minimizer. To substantiate our theoretical insights, we present numerical results that confirm our theoretical predictions.

Paper Structure

This paper contains 18 sections, 11 theorems, 84 equations, 5 figures.

Key Result

Proposition 2.1

Suppose $\mu\in\mathbb{P}(\mathbb{R}^d)$ is absolutely continuous and the embedding map $f:\mathbb{R}^D\rightarrow\mathbb{R}^d$ is invariant under the distribution $\nu$, satisfying eq:inv. Applying a change of variables, we obtain the following reformulation from eq:cost-discrete: where ${\mathrm{sim}}(x,y) = {\mathrm{sim}}_{\mathop{\mathrm{Id}}\nolimits}(x,y) = \frac{x \cdot y}{\|x\|\|y\|}$.

Figures (5)

  • Figure 1: t-SNE visualizations of the MNIST and Cifar10 data sets. In (a) and (b) the images are represented by the raw pixels, while (c) gives a visualization of the SimCLR embedding. This illustrates how SimCLR is able to uncover clustering structure in data sets.
  • Figure 2: Illustration of an invariant feature map $f : \mathbb{R}^D \to \mathbb{R}^d$ that maps the data distribution $\mu$ to the feature distribution $f_\#\mu$ in the latent space, along with a perturbation function $T : \mathbb{R}^D \to \mathbb{R}^D$. The figure shows that both the original point $x$ and the perturbed point $T(x)$ map to $f(x)$ in the feature space.
  • Figure 3: The figure shows the NT-Xent loss for different embedded distributions $f_\#\mu = \frac{1}{K}\sum^{K}_{i=1} \delta_{x_i}$ with $x_i$ on $\mathbb{S}^1$. The first plot shows the loss decreasing with the number of clusters, then plateauing. The second shows the loss decreasing with the minimum squared distance between cluster points, stopping at a threshold. Both suggest that increasing clusters and decreasing $\tau$ reduce the loss. The third plot shows a linear relationship between $\tau$ and the minimum distance.
  • Figure 4: Input data
  • Figure 6: This experiment compares the optimization processes with and without neural network training in 2D and 3D, with the data distribution depicted in (a) and (l). A 4-layer fully connected neural network demonstrates consistent outcome as in \ref{['fig:vec-comparison']}. Each point's color indicates its cluster. Rows 2 and 5 show optimization with neural network training, starting from a random embedding and gradually revealing the clustering structure. In contrast, Rows 3 and 6 illustrate the optimization process using vanilla gradient descent, which converges to a uniformly dispersed arrangement, disregarding the input data's clustering structure.

Theorems & Definitions (25)

  • Proposition 2.1
  • Definition 2.1
  • Proposition 3.1
  • Theorem 3.2
  • Remark 3.1
  • Remark 3.2
  • Proposition 4.1
  • Remark 4.1
  • Remark 4.2
  • Theorem 4.2
  • ...and 15 more