Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture
Dong-Hee Kim, Sungduk Cho, Hyeonwoo Cho, Chanmin Park, Jinyoung Kim, Won Hwa Kim
TL;DR
Mask-JEPA introduces a self-supervised pretraining framework for Mask Classification Architectures (MCA) by integrating Joint Embedding Predictive Architecture (JEPA) with mask classification. It treats the pixel decoder and backbone as the JEPA encoder while using the transformer decoder as the predictor, augmented by Gaussian noise denoising and masked feature reconstruction losses to learn rich semantic and edge-aware representations. The method yields consistent improvements on ADE20K, Cityscapes, and COCO across semantic, instance, and panoptic segmentation, and proves effective in low-data regimes while remaining architecture-agnostic. Ablations and extended analyses demonstrate the importance of reconstruction, denoising, and auxiliary self-attention, and show that pretraining transfers across Mask2Former-based models, enabling robust, scalable self-supervised pretraining for universal image segmentation.
Abstract
In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family.
