Table of Contents
Fetching ...

CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture

András Kalapos, Bálint Gyires-Tóth

TL;DR

CNN-JEPA presents a self-supervised learning method that adapts the joint embedding predictive architecture to Convolutional Neural Networks by introducing a sparse CNN encoder, a fully convolutional depthwise predictor, and an improved masking strategy. The context and target encoders are tied via an EMA update, and the model learns by predicting latent representations of masked patches using an $L_2$ loss. The approach achieves strong linear evaluation on ImageNet-100 (e.g., 73.3% with ResNet-50) and competitive results on ImageNet-1k, while offering significant training-time savings and a projector-free design. This work provides a simpler, more efficient SSL pathway for CNNs with practical implications for large-scale pretraining and deployment.

Abstract

Self-supervised learning (SSL) has become an important approach in pretraining large neural networks, enabling unprecedented scaling of model and dataset sizes. While recent advances like I-JEPA have shown promising results for Vision Transformers, adapting such methods to Convolutional Neural Networks (CNNs) presents unique challenges. In this paper, we introduce CNN-JEPA, a novel SSL method that successfully applies the joint embedding predictive architecture approach to CNNs. Our method incorporates a sparse CNN encoder to handle masked inputs, a fully convolutional predictor using depthwise separable convolutions, and an improved masking strategy. We demonstrate that CNN-JEPA outperforms I-JEPA with ViT architectures on ImageNet-100, achieving a 73.3% linear top-1 accuracy using a standard ResNet-50 encoder. Compared to other CNN-based SSL methods, CNN-JEPA requires 17-35% less training time for the same number of epochs and approaches the linear and k-NN top-1 accuracies of BYOL, SimCLR, and VICReg. Our approach offers a simpler, more efficient alternative to existing SSL methods for CNNs, requiring minimal augmentations and no separate projector network.

CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture

TL;DR

CNN-JEPA presents a self-supervised learning method that adapts the joint embedding predictive architecture to Convolutional Neural Networks by introducing a sparse CNN encoder, a fully convolutional depthwise predictor, and an improved masking strategy. The context and target encoders are tied via an EMA update, and the model learns by predicting latent representations of masked patches using an loss. The approach achieves strong linear evaluation on ImageNet-100 (e.g., 73.3% with ResNet-50) and competitive results on ImageNet-1k, while offering significant training-time savings and a projector-free design. This work provides a simpler, more efficient SSL pathway for CNNs with practical implications for large-scale pretraining and deployment.

Abstract

Self-supervised learning (SSL) has become an important approach in pretraining large neural networks, enabling unprecedented scaling of model and dataset sizes. While recent advances like I-JEPA have shown promising results for Vision Transformers, adapting such methods to Convolutional Neural Networks (CNNs) presents unique challenges. In this paper, we introduce CNN-JEPA, a novel SSL method that successfully applies the joint embedding predictive architecture approach to CNNs. Our method incorporates a sparse CNN encoder to handle masked inputs, a fully convolutional predictor using depthwise separable convolutions, and an improved masking strategy. We demonstrate that CNN-JEPA outperforms I-JEPA with ViT architectures on ImageNet-100, achieving a 73.3% linear top-1 accuracy using a standard ResNet-50 encoder. Compared to other CNN-based SSL methods, CNN-JEPA requires 17-35% less training time for the same number of epochs and approaches the linear and k-NN top-1 accuracies of BYOL, SimCLR, and VICReg. Our approach offers a simpler, more efficient alternative to existing SSL methods for CNNs, requiring minimal augmentations and no separate projector network.
Paper Structure (15 sections, 2 figures, 1 table)

This paper contains 15 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Comparing CNN-JEPA (our method) to I-JEPA SelfSupervisedLearningImages2023a and common SSL methods based on linear top-1 accuracy, and training cost on ImageNet-100. The area of the markers is proportional to the number of parameters in the model.
  • Figure 2: Overview of the CNN-JEPA method. The context and target encoders share a common architecture, with the context encoder using sparse convolutions. The learning objective is to predict the latent representations of masked patches using a predictor, which also trains the context encoder.