Table of Contents
Fetching ...

LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

TL;DR

LeOCLR tackles semantic loss in contrastive self-supervised learning caused by random crops by introducing the original uncropped image as a semantic anchor. It forms three views per instance (X, X^1, X^2), encodes X with a query encoder and the crops with a momentum encoder, and trains to pull each crop toward X while maintaining discriminative power against negatives. The method uses a tailored loss that aligns X^1 and X^2 with X and leverages stop-gradient, leading to improved semantic feature learning. Across ImageNet-1K linear evaluation, transfer, and object-detection tasks, LeOCLR delivers consistent gains over SOTA contrastive methods, demonstrating enhanced robustness to augmentations and better transferability.

Abstract

Contrastive instance discrimination methods outperform supervised learning in downstream tasks such as image classification and object detection. However, these methods rely heavily on data augmentation during representation learning, which can lead to suboptimal results if not implemented carefully. A common augmentation technique in contrastive learning is random cropping followed by resizing. This can degrade the quality of representation learning when the two random crops contain distinct semantic content. To tackle this issue, we introduce LeOCLR (Leveraging Original Images for Contrastive Learning of Visual Representations), a framework that employs a novel instance discrimination approach and an adapted loss function. This method prevents the loss of important semantic features caused by mapping different object parts during representation learning. Our experiments demonstrate that LeOCLR consistently improves representation learning across various datasets, outperforming baseline models. For instance, LeOCLR surpasses MoCo-v2 by 5.1% on ImageNet-1K in linear evaluation and outperforms several other methods on transfer learning and object detection tasks.

LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations

TL;DR

LeOCLR tackles semantic loss in contrastive self-supervised learning caused by random crops by introducing the original uncropped image as a semantic anchor. It forms three views per instance (X, X^1, X^2), encodes X with a query encoder and the crops with a momentum encoder, and trains to pull each crop toward X while maintaining discriminative power against negatives. The method uses a tailored loss that aligns X^1 and X^2 with X and leverages stop-gradient, leading to improved semantic feature learning. Across ImageNet-1K linear evaluation, transfer, and object-detection tasks, LeOCLR delivers consistent gains over SOTA contrastive methods, demonstrating enhanced robustness to augmentations and better transferability.

Abstract

Contrastive instance discrimination methods outperform supervised learning in downstream tasks such as image classification and object detection. However, these methods rely heavily on data augmentation during representation learning, which can lead to suboptimal results if not implemented carefully. A common augmentation technique in contrastive learning is random cropping followed by resizing. This can degrade the quality of representation learning when the two random crops contain distinct semantic content. To tackle this issue, we introduce LeOCLR (Leveraging Original Images for Contrastive Learning of Visual Representations), a framework that employs a novel instance discrimination approach and an adapted loss function. This method prevents the loss of important semantic features caused by mapping different object parts during representation learning. Our experiments demonstrate that LeOCLR consistently improves representation learning across various datasets, outperforming baseline models. For instance, LeOCLR surpasses MoCo-v2 by 5.1% on ImageNet-1K in linear evaluation and outperforms several other methods on transfer learning and object detection tasks.
Paper Structure (13 sections, 3 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 13 sections, 3 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples of positive pairs that might be created by random cropping and resizing.
  • Figure 2: The figure on the left shows the embedding space of established approaches chen2020simplechen2020improved where the two views are attracted to each other regardless of their content. In contrast, the figure on the right illustrates our approach, which clusters the two random views together with the original image in the embedding space.
  • Figure 3: LeOCLR: Overview of the proposed approach. The left part illustrates that the original image $X$ is not cropped (NC), but is resized to 224 $\times$ 224, before applying transformations. The other views ($X^1$ and $X^2$) are randomly cropped (RC1 and RC2) and resized to 224 $\times$ 224, followed by the application of transformations. The embedding space of our approach is depicted on the right side of the Figure.
  • Figure 4: Decrease in top-1 accuracy (in % points) of LeOCLR and our reproduction of vanilla MoCo-v2 after 200 epochs, under linear evaluation on ImageNet-1K. $R\_Grayscale$ refers to results without grayscale augmentations, while $R\_color$ refers to results without color jitter but with grayscale augmentations.
  • Figure 5: Semi-supervised training with a fraction of ImageNet-1K labels on a ResNet-50.