Table of Contents
Fetching ...

One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

Xiaoyu Yang, Lijian Xu, Hongsheng Li, Shaoting Zhang

TL;DR

Occluded Image Contrastive Learning (OCL) presents a simple, scalable pre-training paradigm that pairs masked-image tokens with contrastive learning to emphasize high-level semantic concepts. By generating two non-overlapping views within an image through high-rate masking and applying a symmetric contrastive objective with a T-distributed spherical similarity, OCL yields rich semantic representations without reconstruction or auxiliary modules. The method demonstrates strong scalability on Vision Transformers, achieving 85.8% fine-tuning accuracy on ImageNet with ViT-L/16 after 133 hours on 4 A100 GPUs, and outperforms several prior pre-training approaches while reducing training complexity. This work suggests a practical path toward efficient, high-quality visual representations suitable for large-scale models and downstream tasks.

Abstract

This paper proposes a scalable and straightforward pre-training paradigm for efficient visual conceptual representation called occluded image contrastive learning (OCL). Our OCL approach is simple: we randomly mask patches to generate different views within an image and contrast them among a mini-batch of images. The core idea behind OCL consists of two designs. First, masked tokens have the potential to significantly diminish the conceptual redundancy inherent in images, and create distinct views with substantial fine-grained differences on the semantic concept level instead of the instance level. Second, contrastive learning is adept at extracting high-level semantic conceptual features during the pre-training, circumventing the high-frequency interference and additional costs associated with image reconstruction. Importantly, OCL learns highly semantic conceptual representations efficiently without relying on hand-crafted data augmentations or additional auxiliary modules. Empirically, OCL demonstrates high scalability with Vision Transformers, as the ViT-L/16 can complete pre-training in 133 hours using only 4 A100 GPUs, achieving 85.8\% accuracy in downstream fine-tuning tasks. Code is available at https://anonymous.4open.science/r/OLRS/.

One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

TL;DR

Occluded Image Contrastive Learning (OCL) presents a simple, scalable pre-training paradigm that pairs masked-image tokens with contrastive learning to emphasize high-level semantic concepts. By generating two non-overlapping views within an image through high-rate masking and applying a symmetric contrastive objective with a T-distributed spherical similarity, OCL yields rich semantic representations without reconstruction or auxiliary modules. The method demonstrates strong scalability on Vision Transformers, achieving 85.8% fine-tuning accuracy on ImageNet with ViT-L/16 after 133 hours on 4 A100 GPUs, and outperforms several prior pre-training approaches while reducing training complexity. This work suggests a practical path toward efficient, high-quality visual representations suitable for large-scale models and downstream tasks.

Abstract

This paper proposes a scalable and straightforward pre-training paradigm for efficient visual conceptual representation called occluded image contrastive learning (OCL). Our OCL approach is simple: we randomly mask patches to generate different views within an image and contrast them among a mini-batch of images. The core idea behind OCL consists of two designs. First, masked tokens have the potential to significantly diminish the conceptual redundancy inherent in images, and create distinct views with substantial fine-grained differences on the semantic concept level instead of the instance level. Second, contrastive learning is adept at extracting high-level semantic conceptual features during the pre-training, circumventing the high-frequency interference and additional costs associated with image reconstruction. Importantly, OCL learns highly semantic conceptual representations efficiently without relying on hand-crafted data augmentations or additional auxiliary modules. Empirically, OCL demonstrates high scalability with Vision Transformers, as the ViT-L/16 can complete pre-training in 133 hours using only 4 A100 GPUs, achieving 85.8\% accuracy in downstream fine-tuning tasks. Code is available at https://anonymous.4open.science/r/OLRS/.

Paper Structure

This paper contains 23 sections, 3 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between different pre-training paradigms. The Model in blue is the pre-training model, and the orange modules indicate auxiliary modules. (a) Contrastive Learning (CL) endeavours to maximize the agreement between different views of an image. (b) Masked Image Modeling (MIM) aims to restore masked image patches. (c) Our occluded image contrastive learning: Through non-overlapping occluding, distinct tokens within an image are categorized as intraclass, while across-image tokens within a batch are viewed as interclass. Our objective is to enhance intraclass compactness and interclass separability through a contrastive learning approach. Just as a single leaf can tell the coming of autumn, we believe that a small area of the image contains the majority of the meaning of the entire image.
  • Figure 2: A toy example of masked images for conceptual contrastive learning. The low global masking ratio aids the model in capturing comprehensive information from the image and understanding the interconnectedness of various concepts within a mini-batch. Beyond that, each contrastive branch has a higher masking ratio, generating diverse views with different semantic concepts for contrastive learning and diminishing conceptual redundancy within the image.
  • Figure 3: Efficiency and Scaling. MAE heMaskedAutoencodersAre2022, I-JEPA assranSelfSupervisedLearningImages2023 and MoCo v3 chenEmpiricalStudyTraining2021a are opted for comparison. All methods are evaluated by linear probing with Top-1 accuracy (Acc) as the metric, and the pre-training GPU time with A100 hour as the indicator. The pre-training epochs (denoted as ep) and model architecture are also exhibited.