Table of Contents
Fetching ...

Adaptive Multi-head Contrastive Learning

Lei Wang, Piotr Koniusz, Tom Gedeon, Liang Zheng

TL;DR

A pre-training loss function emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations, which incorporates the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature, regulated to prevent ill solutions.

Abstract

In contrastive learning, two views of an original image, generated by different augmentations, are considered a positive pair, and their similarity is required to be high. Similarly, two views of distinct images form a negative pair, with encouraged low similarity. Typically, a single similarity measure, provided by a lone projection head, evaluates positive and negative sample pairs. However, due to diverse augmentation strategies and varying intra-sample similarity, views from the same image may not always be similar. Additionally, owing to inter-sample similarity, views from different images may be more akin than those from the same image. Consequently, enforcing high similarity for positive pairs and low similarity for negative pairs may be unattainable, and in some cases, such enforcement could detrimentally impact performance. To address this challenge, we propose using multiple projection heads, each producing a distinct set of features. Our pre-training loss function emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations. This loss incorporates the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature, regulated to prevent ill solutions. Our approach, Adaptive Multi-Head Contrastive Learning (AMCL), can be applied to and experimentally enhances several popular contrastive learning methods such as SimCLR, MoCo, and Barlow Twins. The improvement remains consistent across various backbones and linear probing epochs, and becomes more significant when employing multiple augmentation methods.

Adaptive Multi-head Contrastive Learning

TL;DR

A pre-training loss function emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations, which incorporates the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature, regulated to prevent ill solutions.

Abstract

In contrastive learning, two views of an original image, generated by different augmentations, are considered a positive pair, and their similarity is required to be high. Similarly, two views of distinct images form a negative pair, with encouraged low similarity. Typically, a single similarity measure, provided by a lone projection head, evaluates positive and negative sample pairs. However, due to diverse augmentation strategies and varying intra-sample similarity, views from the same image may not always be similar. Additionally, owing to inter-sample similarity, views from different images may be more akin than those from the same image. Consequently, enforcing high similarity for positive pairs and low similarity for negative pairs may be unattainable, and in some cases, such enforcement could detrimentally impact performance. To address this challenge, we propose using multiple projection heads, each producing a distinct set of features. Our pre-training loss function emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations. This loss incorporates the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature, regulated to prevent ill solutions. Our approach, Adaptive Multi-Head Contrastive Learning (AMCL), can be applied to and experimentally enhances several popular contrastive learning methods such as SimCLR, MoCo, and Barlow Twins. The improvement remains consistent across various backbones and linear probing epochs, and becomes more significant when employing multiple augmentation methods.
Paper Structure (16 sections, 6 equations, 9 figures, 8 tables)

This paper contains 16 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: In (a), each column denotes positive (green dots) or negative instances (red dots) with corresponding similarity measures. Additional augmentations can cause positive samples to appear dissimilar and occasionally make negative samples seem similar. The table in (a) shows the original similarity measure (in gray) and the similarity scores from our method (in black).(b)-(d): for traditional contrastive learning methods, when increasing the number of augmentations from 1 to 5, similarities of more positive pairs drop below 0.5, causing more significant overlapping regions between histograms of positive (orange) and negative (blue) pairs. In comparison, our multi-head approach (e)-(g) yields better separation of positive and negative sample pairs as more augmentation types are used, e.g., (g) vs.(d).
  • Figure 2: A comparison of (a) the standard constant-temperature, single-head approach and (b)–(c) our adaptive temperature, single- and multi-head approaches. In each subfigure, the first light blue trapezoid represents the base encoder, the second light blue trapezoid signifies the MLP projection head, and the third light orange trapezoid denotes the shared MLP layer for learning the temperature parameters. In (c), the projection head is replicated $C$ times to capture diverse image content. For better visualization, we set $C\!=\!3$ for simplicity. The architecture of the MLP projection head remains unchanged; however, the weights are learned independently. For a given image pair, each projection head produces a pair of feature vectors, which are later used for learning the pair-adaptive, head-wise temperature. The learned temperatures, along with the projected features, are seamlessly incorporated into our Adaptive Multi-head Contrastive Learning (AMCL) loss function (as presented in Table \ref{['tab:loss_functions2']}).
  • Figure 3: Distribution of similarity scores for positive and negative pairs. The baseline uses one projection head and constant temperature, while our method has multiple projection heads and adaptive temperature. We use SimCLR for pre-training with ResNet-18 on STL-10. After pre-training, we choose 500 positive pairs and 500 negative pairs from the validation to compute the cosine similarity. In (a) and (b), similarity score (temperature scaled) is computed between the 128-dim features extracted from the projection head(s). In (c) and (d), cosine similarity score is computed between the 512-dim features extracted from the backbone after removing the project heads.
  • Figure 4: Impact of different backbones and training epochs on AMCL for (left:) SimCLR and (right:) CAN on the ImageNet dataset. All the reported accuracies use a linear probe.
  • Figure 5: Hyperparameter sensitivity analysis. (left:) number of projection heads. (right:) Evaluation of top-$\kappa$ similarities among negative pairs on STL-10. "Softmax" involves including all negative pair similarities (Eq. (\ref{['eq:jump']}) in Appendix \ref{['sec:deriv']}). We use Resnet-18 as backbone with SimCLR. The dashed line is the baseline result with 1 projection head, constant temperature.
  • ...and 4 more figures