Table of Contents
Fetching ...

$f$-MICL: Understanding and Generalizing InfoNCE-based Contrastive Learning

Yiwei Lu, Guojun Zhang, Sun Sun, Hongyu Guo, Yaoliang Yu

TL;DR

The paper introduces $f$-MICL, a generalization of InfoNCE through $f$-mutual information to a broad class of $f$-divergences, enabling a family of contrastive objectives that can outperform KL-based methods on task-specific data. It derives a principled $f$-Gaussian similarity under a Gaussian-kernel joint density, and shows that the optimal similarity in this framework is $s_f = f'\circ G_\sigma(\|x^g - y^g\|^2)$, providing a stronger alternative to cosine. The work also connects $f$-MICL to existing objectives like InfoNCE, AU, SimCLR, and MoCo, and demonstrates through extensive vision and NLP experiments that the choice of $f$-divergence is task-dependent, with the $f$-Gaussian variant consistently improving performance. Overall, $f$-MICL offers a flexible, theoretically grounded framework that improves representation learning by generalizing the objective and adopting a data-driven similarity measure, with practical implications for self-supervised learning systems across domains.

Abstract

In self-supervised contrastive learning, a widely-adopted objective function is InfoNCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback-Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the $f$-Mutual Information in Contrastive Learning ($f$-MICL) using the $f$-divergences. To answer the first question, we provide a wide range of $f$-MICL objectives which share the nice properties of InfoNCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an $f$-Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the $f$-MICL objective and several popular InfoNCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate $f$-MICL with different $f$-divergences on various architectures (SimCLR, MoCo, and MoCo v3) and datasets. We observe that $f$-MICL generally outperforms the benchmarks and the best-performing $f$-divergence is task and dataset dependent.

$f$-MICL: Understanding and Generalizing InfoNCE-based Contrastive Learning

TL;DR

The paper introduces -MICL, a generalization of InfoNCE through -mutual information to a broad class of -divergences, enabling a family of contrastive objectives that can outperform KL-based methods on task-specific data. It derives a principled -Gaussian similarity under a Gaussian-kernel joint density, and shows that the optimal similarity in this framework is , providing a stronger alternative to cosine. The work also connects -MICL to existing objectives like InfoNCE, AU, SimCLR, and MoCo, and demonstrates through extensive vision and NLP experiments that the choice of -divergence is task-dependent, with the -Gaussian variant consistently improving performance. Overall, -MICL offers a flexible, theoretically grounded framework that improves representation learning by generalizing the objective and adopting a data-driven similarity measure, with practical implications for self-supervised learning systems across domains.

Abstract

In self-supervised contrastive learning, a widely-adopted objective function is InfoNCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback-Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the -Mutual Information in Contrastive Learning (-MICL) using the -divergences. To answer the first question, we provide a wide range of -MICL objectives which share the nice properties of InfoNCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an -Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the -MICL objective and several popular InfoNCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate -MICL with different -divergences on various architectures (SimCLR, MoCo, and MoCo v3) and datasets. We observe that -MICL generally outperforms the benchmarks and the best-performing -divergence is task and dataset dependent.
Paper Structure (28 sections, 6 theorems, 39 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 28 sections, 6 theorems, 39 equations, 7 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

Suppose $f$ is differentiable, and the embedding function $g$ is fixed. The following similarity function $s_\star$ maximizes eq:objective:

Figures (7)

  • Figure 1: Experiment for verifying Assumption \ref{['assmp:joint_vMF']}. Here we draw the relation between the squared distances $\|x^g - y^g\|^2$ and the averaged log likelihood $\log p_g$, with $\log p_g$ estimated by the flow model RealNVP dinh2016density. ( left) Gaussian prior; (right) Uniform prior. The features are learned by SimCLR trained on CIFAR-10. See more details in Appendix \ref{['sec:add_exp']}.
  • Figure 2: Network architecture of $f$-MICL. $\mathtt{image}_i$: the $i^{\rm th}$ image in the current batch; $f$: the function used in the $f$-mutual information (§\ref{['sec:prem']}); $g$: feature embedding; $t$, $t_1$, $t_2$: augmentation functions drawn from the same family $\mathcal{T}$ of augmentations; $f'$: the derivative; $f^*$: the Fenchel conjugate. The symbol $\circ$ denotes the function composition. The sum of the two terms gives the variational lower bound of $f$-mutual information. $x_i$ and $y_i$ are two types of data augmentation of the $i$-th sample, and $x_i$ and $x_j$ are different samples with independently sampled data augmentations.max stands for maximization. See \ref{['eq:objective_sample']} for more details.
  • Figure 3: $f$-MICL generalizes InfoNCE-based objectives.
  • Figure 4: (left and middle) Distances between pairs of normalized features within a batch. Green region: similar pairs. Orange region: dissimilar pairs. $f$-MICL gives nearly uniform distances for dissimilar pairs for the $f$-divergences in Table \ref{['tbl:choices_f_div']}. For non-satisfying $f$-divergences such as the RKL, the features collapse to a constant and thus the distances are zero. (right) The test accuracy v.s. the batch size after training $200$ epochs for all algorithms.
  • Figure 5: A regular simplex on a hypersphere.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 1: $f$-mutual information, csiszar1967information
  • Lemma 1: , nguyen2010estimating
  • Theorem 3: Uniformity
  • Proposition 3: weighting parameter
  • proof
  • Lemma 3: , nguyen2010estimating
  • proof
  • Theorem 3: Uniformity
  • proof
  • Theorem 4: estimation error