Table of Contents
Fetching ...

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Shentong Mo, Shengbang Tong

TL;DR

A novel framework, namely C-JEPA (Contrastive-JEPA), is introduced, which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy, designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views.

Abstract

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

TL;DR

A novel framework, namely C-JEPA (Contrastive-JEPA), is introduced, which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy, designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views.

Abstract

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

Paper Structure

This paper contains 23 sections, 22 equations, 8 figures, 15 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our C-JEPA achieves faster and better convergence than I-JEPA.
  • Figure 2: Illustration of I-JEPA (a) and SimSiam (b).
  • Figure 3: Qualitative visualization of learned attention maps using ViT-B/16 model. Columns for each sample denote the original image, attention maps from the target encoder in I-JEPA assran2023ijepa, attention maps from the target encoder in our C-JEPA, and attention maps from the context encoder in our C-JEPA. Our C-JEPA achieves much better attention maps.
  • Figure 4: Qualitative visualization of learned attention maps using ViT-B/16 model. Columns for each sample denote the original image, attention maps from the target encoder in I-JEPA assran2023ijepa, attention maps from the target encoder in our C-JEPA, and attention maps from the context encoder in our C-JEPA. Our C-JEPA achieves much better attention maps.
  • Figure 5: Qualitative visualization of learned attention maps using ViT-B/16 model. Columns for each sample denote the original image, attention maps from the target encoder in I-JEPA assran2023ijepa, attention maps from the target encoder in our C-JEPA, and attention maps from the context encoder in our C-JEPA. Our C-JEPA achieves much better attention maps.
  • ...and 3 more figures