Table of Contents
Fetching ...

Variational Supervised Contrastive Learning

Ziwen Wang, Jiajun Fan, Thao Nguyen, Heng Ji, Ge Liu

TL;DR

This work reframes supervised contrastive learning as variational inference over latent class variables, yielding a posterior-weighted ELBO that uses class centroids and a confidence-aware temperature to regulate intra-class dispersion. The VarCon objective couples distribution alignment (via KL divergence) with predictable classification likelihood, enabling efficient, near-linear training in batch size and improved semantic structure in embeddings. Empirical results across CIFAR and ImageNet benchmarks show state-of-the-art Top-1 accuracy, faster convergence, stronger few-shot and transfer performance, and robustness to augmentation strategies and corruption. The approach elegantly bridges discriminative and generative perspectives by endowing contrastive learning with explicit probabilistic semantics and uncertainty modeling.

Abstract

Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies. Our code is available at https://github.com/ziwenwang28/VarContrast.

Variational Supervised Contrastive Learning

TL;DR

This work reframes supervised contrastive learning as variational inference over latent class variables, yielding a posterior-weighted ELBO that uses class centroids and a confidence-aware temperature to regulate intra-class dispersion. The VarCon objective couples distribution alignment (via KL divergence) with predictable classification likelihood, enabling efficient, near-linear training in batch size and improved semantic structure in embeddings. Empirical results across CIFAR and ImageNet benchmarks show state-of-the-art Top-1 accuracy, faster convergence, stronger few-shot and transfer performance, and robustness to augmentation strategies and corruption. The approach elegantly bridges discriminative and generative perspectives by endowing contrastive learning with explicit probabilistic semantics and uncertainty modeling.

Abstract

Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies. Our code is available at https://github.com/ziwenwang28/VarContrast.

Paper Structure

This paper contains 46 sections, 50 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: VarCon architectural flowchart and Pseudocode. Left: Input images are processed through an encoder network to produce $\ell_2$-normalized embeddings $\bm{z}$. Class-level centroids $\bm{w}_r$ are computed dynamically from mini-batch embeddings. The model determines sample's classification difficulty and applies confidence-adaptive temperature scaling $\tau_2(\bm{z})$, which tightens constraints on challenging samples and relaxes them for well-classified examples. Right: Pseudocode implementation of our ELBO-derived loss function combining KL divergence and negative log-likelihood terms.
  • Figure 2: (a) Top-1 accuracy on ImageNet versus temperature parameter; (b) Top-1 accuracy on ImageNet versus training epochs; (c) Top-1 accuracy on ImageNet versus batch size.
  • Figure 3: (a) KNN classifier accuracy on ImageNet embeddings; (b) Effect of adaptive temperature parameter $\epsilon$ on ImageNet Top-1 accuracy; (c) Robustness evaluation on ImageNet-C across different corruption severity levels.
  • Figure 4: Evolution of adaptive temperature $\tau_2$ during a full 50-epoch ImageNet training with ResNet-50 ($\epsilon = 0.02$, $\tau_1 = 0.1$, batch size 4096). (a) Mean $\tau_2$ increases from 0.09378 to 0.10656 over the complete training, indicating systematic confidence growth. (b) Density distributions (epochs 10-50) show rightward shift with initial broadening (std: 0.00961 $\rightarrow$ 0.01092) then stabilization, reflecting heterogeneous confidence development as the model distinguishes easy from hard samples.
  • Figure 5: Progressive evolution of VarCon embedding space visualization through t-SNE SimCLR during training on ImageNet validation set. Our variational formulation demonstrates systematic improvement in semantic organization, with KNN-classifier accuracy increasing from 52.74% at epoch 50 to 79.11% at epoch 200 as clusters become increasingly well-separated and semantically coherent. The confidence-adaptive temperature mechanism enables fine-grained control over intra-class dispersion, resulting in embedding spaces with clear decision boundaries and hierarchical semantic structure that facilitate effective nearest-neighbor classification without additional parameterized classifiers.
  • ...and 4 more figures