Table of Contents
Fetching ...

Enhancing Contrastive Learning Inspired by the Philosophy of "The Blind Men and the Elephant"

Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Yu Wang

TL;DR

The paper tackles the design of positive pairs in contrastive learning by introducing JointCrop and JointBlur, which impose a correlation between augmentation parameters to generate harder positive views. By formalizing a Unified JointAugmentation framework and controlling the joint sampling of crops and blur strengths (via JC(β)), the approach yields improved representations across multiple self-supervised methods and datasets, including ImageNet-1K and downstream tasks like VOC and COCO. The results show consistent gains with minimal overhead and demonstrate generalization to other augmentations, suggesting practical impact for boosting self-supervised vision models. The work also discusses limitations such as the energy cost of pretraining and outlines future directions to broaden augmentation families and integration with existing techniques.

Abstract

Contrastive learning is a prevalent technique in self-supervised vision representation learning, typically generating positive pairs by applying two data augmentations to the same image. Designing effective data augmentation strategies is crucial for the success of contrastive learning. Inspired by the story of the blind men and the elephant, we introduce JointCrop and JointBlur. These methods generate more challenging positive pairs by leveraging the joint distribution of the two augmentation parameters, thereby enabling contrastive learning to acquire more effective feature representations. To the best of our knowledge, this is the first effort to explicitly incorporate the joint distribution of two data augmentation parameters into contrastive learning. As a plug-and-play framework without additional computational overhead, JointCrop and JointBlur enhance the performance of SimCLR, BYOL, MoCo v1, MoCo v2, MoCo v3, SimSiam, and Dino baselines with notable improvements.

Enhancing Contrastive Learning Inspired by the Philosophy of "The Blind Men and the Elephant"

TL;DR

The paper tackles the design of positive pairs in contrastive learning by introducing JointCrop and JointBlur, which impose a correlation between augmentation parameters to generate harder positive views. By formalizing a Unified JointAugmentation framework and controlling the joint sampling of crops and blur strengths (via JC(β)), the approach yields improved representations across multiple self-supervised methods and datasets, including ImageNet-1K and downstream tasks like VOC and COCO. The results show consistent gains with minimal overhead and demonstrate generalization to other augmentations, suggesting practical impact for boosting self-supervised vision models. The work also discusses limitations such as the energy cost of pretraining and outlines future directions to broaden augmentation families and integration with existing techniques.

Abstract

Contrastive learning is a prevalent technique in self-supervised vision representation learning, typically generating positive pairs by applying two data augmentations to the same image. Designing effective data augmentation strategies is crucial for the success of contrastive learning. Inspired by the story of the blind men and the elephant, we introduce JointCrop and JointBlur. These methods generate more challenging positive pairs by leveraging the joint distribution of the two augmentation parameters, thereby enabling contrastive learning to acquire more effective feature representations. To the best of our knowledge, this is the first effort to explicitly incorporate the joint distribution of two data augmentation parameters into contrastive learning. As a plug-and-play framework without additional computational overhead, JointCrop and JointBlur enhance the performance of SimCLR, BYOL, MoCo v1, MoCo v2, MoCo v3, SimSiam, and Dino baselines with notable improvements.

Paper Structure

This paper contains 28 sections, 10 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: The motivation of our paper. We use the philosophy of the blind men and the elephant to analyze contrastive learning between positive sample pairs.
  • Figure 2: The statistical difficulty between the positive pairs generated by different fixed area ratios $s_r=s_2/s_1$.
  • Figure 3: The probability density map of JointCrop, which controls the area ratios of positive pairs obeying a series of distributions JC$(\beta)$ controlled by $\beta$. The smaller $\beta$ leads to the higher probability that the ratios are far from 1.
  • Figure 4: The SDF$(\mathcal{T})$ between the positive pairs generated by J-Crop$(\beta)$ is measured using the already trained SimSiam encoder on the whole ImageNet-1K training dataset.
  • Figure 5: Training losses during SimSiam training on Tiny-ImageNet with samples generated by J-Crop$(\beta)$. We smooth the losses using a sliding window with a window size of 20. Our JointCrop creates positive pairs that are more challenging than those generated by RandomCrop.
  • ...and 3 more figures