Table of Contents
Fetching ...

Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

Dong-Hee Kim, Sungduk Cho, Hyeonwoo Cho, Chanmin Park, Jinyoung Kim, Won Hwa Kim

TL;DR

Mask-JEPA introduces a self-supervised pretraining framework for Mask Classification Architectures (MCA) by integrating Joint Embedding Predictive Architecture (JEPA) with mask classification. It treats the pixel decoder and backbone as the JEPA encoder while using the transformer decoder as the predictor, augmented by Gaussian noise denoising and masked feature reconstruction losses to learn rich semantic and edge-aware representations. The method yields consistent improvements on ADE20K, Cityscapes, and COCO across semantic, instance, and panoptic segmentation, and proves effective in low-data regimes while remaining architecture-agnostic. Ablations and extended analyses demonstrate the importance of reconstruction, denoising, and auxiliary self-attention, and show that pretraining transfers across Mask2Former-based models, enabling robust, scalable self-supervised pretraining for universal image segmentation.

Abstract

In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family.

Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

TL;DR

Mask-JEPA introduces a self-supervised pretraining framework for Mask Classification Architectures (MCA) by integrating Joint Embedding Predictive Architecture (JEPA) with mask classification. It treats the pixel decoder and backbone as the JEPA encoder while using the transformer decoder as the predictor, augmented by Gaussian noise denoising and masked feature reconstruction losses to learn rich semantic and edge-aware representations. The method yields consistent improvements on ADE20K, Cityscapes, and COCO across semantic, instance, and panoptic segmentation, and proves effective in low-data regimes while remaining architecture-agnostic. Ablations and extended analyses demonstrate the importance of reconstruction, denoising, and auxiliary self-attention, and show that pretraining transfers across Mask2Former-based models, enabling robust, scalable self-supervised pretraining for universal image segmentation.

Abstract

In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family.
Paper Structure (27 sections, 8 equations, 8 figures, 13 tables)

This paper contains 27 sections, 8 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The graph demonstrates that the Mask-JEPA improves universal image segmentation performance across ResNet-50 and Swin Transformer Tiny backbone with Mask2Former, as evidenced by higher PQ, AP, and mIoU scores.
  • Figure 2: Visualization of $\mathcal{F}_\text{mask}$ from a well-trained Mask2Former with segmentation labels. A trained Mask2Former with segmentation labels not only accurately masks each object (blue box) but also sharply captures edges (red box). The Mask-JEPA is designed to mimic these behaviors without segmentation labels.
  • Figure 3: Mask-JEPA Overview. Mask-JEPA features an online mask classifier and a target backbone with a pixel decoder, updated through an exponential moving average from online versions . The target model processes image $x$, while the online model handles $x'$ with Gaussian noise $\epsilon$. The online transformer decoder processes features $\mathcal{F}_{i_1}$ from the pixel decoder along with random queries. In this step, features $\mathcal{F}_{i_1}$ are subjected to a masking process and replaced with mask tokens. The decoder's output predicts features in the target pixel decoder. Lastly, feature $\mathcal{F}_{i_\text{last}}$ from the pixel decoder undergoes a $1 \times 1$ convolution, predicting the original image $x$.
  • Figure 4: Qualitative results. Our Mask-JEPA pretrained model achieves more accurate detection and segmentation compared to plain Mask2Former training (as shown within the white boxes). Both trained with Swin-T backbone. Zoom in for a closer view.
  • Figure 5: Visualization of Mask-JEPA pretrained pixel decoder output. We visualized the output $\mathcal{F}_{i_\text{last}}$ from the pixel decoder using k-means clustering. The results show that models trained with Mask-JEPA (row 3) effectively identify both semantic objects and edges. In contrast, a pixel decoder without pretraining (row 2) struggles to cluster similar semantics. e.g., in column 2, it completely fails to detect automobiles.
  • ...and 3 more figures