Table of Contents
Fetching ...

What Should Not Be Contrastive in Contrastive Learning

Tete Xiao, Xiaolong Wang, Alexei A. Efros, Trevor Darrell

TL;DR

The paper tackles the problem that fixed augmentation invariances in contrastive self-supervised learning can impair downstream performance. It proposes Leave-one-out Contrastive Learning (LooC), a multi-embedding approach where each subspace is sensitive to a single augmentation while invariant to others, allowing task-driven combination of factors of variation. Across ImageNet-100 and diverse datasets, LooC and its concatenated variant LooC++ show superior transferability, few-shot performance, and robustness to corruptions compared to MoCo and various ablations. The work highlights the importance of modeling augmentation-dependent information rather than enforcing global invariances, with practical benefits for broad vision tasks.

Abstract

Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.

What Should Not Be Contrastive in Contrastive Learning

TL;DR

The paper tackles the problem that fixed augmentation invariances in contrastive self-supervised learning can impair downstream performance. It proposes Leave-one-out Contrastive Learning (LooC), a multi-embedding approach where each subspace is sensitive to a single augmentation while invariant to others, allowing task-driven combination of factors of variation. Across ImageNet-100 and diverse datasets, LooC and its concatenated variant LooC++ show superior transferability, few-shot performance, and robustness to corruptions compared to MoCo and various ablations. The work highlights the importance of modeling augmentation-dependent information rather than enforcing global invariances, with practical benefits for broad vision tasks.

Abstract

Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.

Paper Structure

This paper contains 26 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Self-supervised contrastive learning relies on data augmentations as depicted in (a) to learn visual representations. However, current methods introduce inductive bias by encouraging neural networks to be less sensitive to information w.r.t. augmentation, which may help or may hurt. As illustrated in (b), rotation invariant embeddings can help on certain flower categories, but may hurt animal recognition performance; conversely color invariance generally seems to help coarse grained animal classification, but can hurt many flower categories and bird categories. Our method, shown in the following figure, overcomes this limitation.
  • Figure 2: Framework of the Leave-one-out Contrastive Learning approach, illustrated with two types of augmentations, i.e., random rotation and color jittering. We generate multiple views with leave-one-out strategy, then project their representations into separate embedding spaces with contrastive objective, where each embedding space is either invariant to all augmentations, or invariant to all but one augmentation. The learnt representation can be the general embedding space $\mathcal{V}$ (blue region), or the concatenation of embedding sub-spaces $\mathcal{Z}$ (grey region). Our results show that either of our proposed representations are able to outperform baseline contrastive embeddings and do not suffer from decreased performance when adding augmentations to which the task is not invariant (i.e., the red X's in Figure 1).
  • Figure 3: Top nearest-neighbor retrieval results of LooC vs. corresponding invariant MoCo baseline with color (left) and rotation (right) augmentations on IN-100 and iNat-1k. The results show that our model can better preserve information dependent on color and rotation despite being trained with those augmentations.
  • Figure 4: Histograms of correct predictions (activations${\times}$weights of classifier) by each augmentation-dependent head from IN-100 and iNat-1k. The classifier on IN-100 heavily relies on texture-dependent information, whereas it is much more balanced on iNat-1k. This is consistent with the improvement gains observed when learning with multiple augmentations.