Table of Contents
Fetching ...

Contrastive Learning Is Spectral Clustering On Similarity Graph

Zhiquan Tan, Yifan Zhang, Jingqin Yang, Yang Yuan

TL;DR

This work proves that InfoNCE-based contrastive learning (e.g., SimCLR) is equivalent to spectral clustering on the augmentation-derived similarity graph, and extends the theory to multi-modal CLIP, yielding a representation-theoretic view of embedding as spectral clustering on a pair graph. Building on a Markov random field framework and a maximum-entropy argument, the authors introduce Kernel-InfoNCE, using mixtures of exponential kernels to better capture local similarity, and derive practical kernel choices such as Simple Sum and Concatenation Sum. Empirically, Kernel-InfoNCE improves over Gaussian-kernel SimCLR on CIFAR-10/100 and TinyImageNet, and LaCLIP is discussed as a natural extension to enhance cross-modal clustering. The work provides reproducible code and a cohesive theoretical lens for understanding and improving contrastive learning across single- and multi-modal settings.

Abstract

Contrastive learning is a powerful self-supervised learning method, but we have a limited theoretical understanding of how it works and why it works. In this paper, we prove that contrastive learning with the standard InfoNCE loss is equivalent to spectral clustering on the similarity graph. Using this equivalence as the building block, we extend our analysis to the CLIP model and rigorously characterize how similar multi-modal objects are embedded together. Motivated by our theoretical insights, we introduce the Kernel-InfoNCE loss, incorporating mixtures of kernel functions that outperform the standard Gaussian kernel on several vision datasets. The code is available at https://github.com/yifanzhang-pro/Kernel-InfoNCE.

Contrastive Learning Is Spectral Clustering On Similarity Graph

TL;DR

This work proves that InfoNCE-based contrastive learning (e.g., SimCLR) is equivalent to spectral clustering on the augmentation-derived similarity graph, and extends the theory to multi-modal CLIP, yielding a representation-theoretic view of embedding as spectral clustering on a pair graph. Building on a Markov random field framework and a maximum-entropy argument, the authors introduce Kernel-InfoNCE, using mixtures of exponential kernels to better capture local similarity, and derive practical kernel choices such as Simple Sum and Concatenation Sum. Empirically, Kernel-InfoNCE improves over Gaussian-kernel SimCLR on CIFAR-10/100 and TinyImageNet, and LaCLIP is discussed as a natural extension to enhance cross-modal clustering. The work provides reproducible code and a cohesive theoretical lens for understanding and improving contrastive learning across single- and multi-modal settings.

Abstract

Contrastive learning is a powerful self-supervised learning method, but we have a limited theoretical understanding of how it works and why it works. In this paper, we prove that contrastive learning with the standard InfoNCE loss is equivalent to spectral clustering on the similarity graph. Using this equivalence as the building block, we extend our analysis to the CLIP model and rigorously characterize how similar multi-modal objects are embedded together. Motivated by our theoretical insights, we introduce the Kernel-InfoNCE loss, incorporating mixtures of kernel functions that outperform the standard Gaussian kernel on several vision datasets. The code is available at https://github.com/yifanzhang-pro/Kernel-InfoNCE.
Paper Structure (24 sections, 6 theorems, 25 equations, 3 figures, 4 tables, 2 algorithms)

This paper contains 24 sections, 6 theorems, 25 equations, 3 figures, 4 tables, 2 algorithms.

Key Result

Lemma 2.3

For $\mathbf{W}\sim \mathbb{P}(\cdot; \boldsymbol{\pi})$, $\forall i \in [n], \mathbf{W}_i\sim \mathcal{M}(1, \boldsymbol{\pi}_i/\sum_j \boldsymbol{\pi}_{i,j})$, where $\mathcal{M}$ is the multinomial distribution. Moreover, given any $i,i'\in [n]$, $\mathbf{W}_i$ is independent to $\mathbf{W}_{i'}$

Figures (3)

  • Figure 1: An illustration of our analysis. The similarity matrix $\boldsymbol{\pi}$ encapsulates the relationships between various images. Given the large size of the matrix, we employ a technique known as Markov Random Field sampling to sidestep the issue of direct utilization. Through our research, we discovered an equivalence between InfoNCE loss and a method known as spectral clustering when a Gaussian kernel function was utilized, thus validating our approach.
  • Figure 2: Sampling probabilities of the subgraphs defined by $\mathbb{P} (\mathbf{W};\boldsymbol{\pi})$. The first subfigure represents the underlying graph $\boldsymbol{\pi}$, the next three subfigures represent three different subgraphs with their sampling probabilities. The last subfigure has sampling probability $0$ because the purple node has out-degree larger than $1$.
  • Figure 3: Visualizations of the optimization process using InfoNCE Loss on the vectors corresponding to $\boldsymbol{\pi}$. Points of identical color belong to the same cluster within $\boldsymbol{\pi}$. To showcase the internal structure of $\boldsymbol{\pi}$, we randomly select 10 vertices from each cluster to display the edge distribution of $\boldsymbol{\pi}$.

Theorems & Definitions (17)

  • Definition 2.1: Reproducing kernel Hilbert space
  • Definition 2.2: Distribution of $\mathbf{W}$
  • Lemma 2.3
  • Lemma 2.4
  • Lemma 2.5
  • Definition 2.6: Graph Laplacian operator
  • Definition 2.7: Spectral Clustering
  • Theorem 3.1
  • proof
  • Definition 4.1: Pair graph
  • ...and 7 more