Table of Contents
Fetching ...

Poly-View Contrastive Learning

Amitis Shidani, Devon Hjelm, Jason Ramapuram, Russ Webb, Eeshan Gunesh Dhekane, Dan Busbridge

TL;DR

Poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.

Abstract

Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.

Poly-View Contrastive Learning

TL;DR

Poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.

Abstract

Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.
Paper Structure (77 sections, 18 theorems, 106 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 77 sections, 18 theorems, 106 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Proposition 2.1

For $K$ independent samples and multiplicity $M$ denoted $\rmX_{1:K, 1:M}$, the Multi-Crop of any $\mathcal{L}_{\textrm{Pair}}$ in eq:pairwise-loss has the same lower bound as the corresponding $\mathcal{L}_{\textrm{Pair}}$: where the expectation is over $K$ independent samples (see app-subsec:multicrop-mi for the proof).

Figures (5)

  • Figure 1: (a) The role of multiplicity in contrastive learning. $\mathcal{I}(\rvx; \rvy)$ present the between two random variables $\rvx$ and $\rvy$, while $\mathcal{I}(\rvx; \rmY)$ is the between $\rvx$ and the set of s $\rmY$. $\mathcal{L}_{\textrm{Method}}$ denotes the contrastive lower-bound achieved by each method, ignoring the constants. In the multi-crop box, $\ell_\alpha(\rvx, \rvy)$ is the contrastive lower-bound produced by the $\alpha$-th crop/view. (b) The multiple view sample generation with generative factor $\rvc$, where the main sample is generated through the generative process $\rho$, and views are generated through different view-generation processes $\eta_\alpha$ for $\alpha \in [M]$, e.g. augmentations. The goal is to find the map $h^\star$ such that the reconstructed generative factor $\hat{\rvc}$ recovers $\rvc$, hence the identity map.
  • Figure 2: Comparing bounds with true in the Gaussian setting. Each method is trained for $200$ with multiplicities $M \in \{2, 4, 8, 10\}$. Left to right: 1) True One-vs-Rest (\ref{['eq:gaussian-mi']}); 2) Gaps decrease as $M$ grows for all methods except Multi-Crop due to the $\log(K)$ factor; 3) Relative = True MI / Lower Bound MI; and 4) losses for each objective. Bands indicate the mean and standard deviation across $16$ runs. Points indicate final model performance of corresponding hyperparameters.
  • Figure 3: Contrastive ResNet 50 trained on ImageNet1k for different epochs or with different view multiplicities. Blue, red, orange and black dashed lines represent Geometric, Multi-Crop, Sufficient Statistics, and SimCLR respectively. Bands indicate the mean and standard deviation across three runs. Points indicate final model performance of corresponding hyperparameters. We use $K=4096$ for Growing Batch and $K=(2/M)\times 4096$ for Fixed Batch. (a) Each method is trained with a multiplicity $M=8$ except the $M=2$ SimCLR baseline. We compare models in terms of performance against training epochs (left), total updates (middle) which is affected by batch size $K$, and relative compute (right) which is defined in \ref{['eq:relative-compute']}. See \ref{['subsubsec:flops']} for a FLOPs comparison. b) Each method is trained for 128 epochs for each multiplicity $M\in\{2,3,4,6,8,12,16\}$.
  • Figure 4: ResNet 50 trained for 128 epochs with different objectives for different strengths of color augmentation (a) and cropping strategy (b). Geometric and Arithmetic methods presented use multiplicity $M=4$.
  • Figure 5: Training at multiplicity $M=8$ varying training epochs.

Theorems & Definitions (32)

  • Definition 2.1: View Multiplicity
  • Proposition 2.1
  • Proposition 2.2
  • Definition 2.2: One-vs-Rest
  • Theorem 2.1: Generalized $\mathcal{I}_{\textnormal{NWJ}}$
  • Definition 2.3: Gap
  • Theorem 2.2: Arithmetic and Geometric PVC lower bound One-vs-Rest MI
  • Theorem 2.3
  • Theorem 2.4: Sufficient Statistics lower bound One-vs-Rest MI
  • Proposition \ref{thm:multicrop-mi}
  • ...and 22 more