Table of Contents
Fetching ...

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez

TL;DR

MOCA presents a novel self-supervised framework for Vision Transformers that unifies dense contextual reasoning with perturbation invariance by predicting mask-based online codebook assignments. It leverages a teacher-student EMA to generate high-level token assignment targets over an online codebook and optimizes two complementary losses, $L_{ ext{LOC}}$ and $L_{ ext{IMG}}$, balanced by a parameter to encourage both local and global consistency. Key innovations include simple online codebook construction with dynamic prototypes $W^d$ and $W^b$ generated by $G^d$ and $G^b$, condenser-based decoding to promote spatial structure, and partial decoding to improve training efficiency. Empirically, MOCA achieves state-of-the-art or competitive results in linear, k-NN, and low-shot settings on ImageNet, with faster training than prior methods, and strong performance on segmentation and detection benchmarks, highlighting its practical impact for efficient, versatile representation learning. Overall, MOCA advances self-supervised learning for ViTs by coupling high-level, mask-based predictions with online codebooks, enabling robust, transferable visual representations with reduced computational cost.

Abstract

Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods. We provide the implementation code at https://github.com/valeoai/MOCA.

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

TL;DR

MOCA presents a novel self-supervised framework for Vision Transformers that unifies dense contextual reasoning with perturbation invariance by predicting mask-based online codebook assignments. It leverages a teacher-student EMA to generate high-level token assignment targets over an online codebook and optimizes two complementary losses, and , balanced by a parameter to encourage both local and global consistency. Key innovations include simple online codebook construction with dynamic prototypes and generated by and , condenser-based decoding to promote spatial structure, and partial decoding to improve training efficiency. Empirically, MOCA achieves state-of-the-art or competitive results in linear, k-NN, and low-shot settings on ImageNet, with faster training than prior methods, and strong performance on segmentation and detection benchmarks, highlighting its practical impact for efficient, versatile representation learning. Overall, MOCA advances self-supervised learning for ViTs by coupling high-level, mask-based predictions with online codebooks, enabling robust, transferable visual representations with reduced computational cost.

Abstract

Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods. We provide the implementation code at https://github.com/valeoai/MOCA.
Paper Structure (48 sections, 6 equations, 3 figures, 12 tables)

This paper contains 48 sections, 6 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Comparison ofMOCAwith state-of-the-art methods using ViT-B/16. (a) K-NN ImageNet classification accuracy vs. pre-training time; (b) One-, two-, five- and $\approx$13-shot (the last corresponding to $1\%$ training data) ImageNet classification accuracy; (c) Semantic segmentation on Cityscapes using linear probing and (d) fine-tuning with $100$, $372$, and $2975$ training images. MOCA achieves superior results whilst requiring 3 times less training time. '$\dagger$' denotes usage of multiple crops caron2021emerging.
  • Figure 2: Overview ofMOCA. The teacher (bottom) takes as an input two unmasked random views $\mathbf{x}^{\{1,2\}}$ of the same image and generates dense token-wise code assignments $q_{\mathrm{T}}(\mathbf{x})$ for them (i.e., soft-assigns codebook items to the patch tokens). The student (top) receives as an input a randomly masked image version $\tilde{\mathbf{x}}^1$ of view $\mathbf{x}^1$ and is trained to minimize two types of self-supervised losses: (1) A masked same-view token assignment prediction loss $L_{\text{LOC\xspace}}$, which requires predicting the teacher-produced assignment vectors of view $\mathbf{x}^1$ from the corresponding masked image $\tilde{\mathbf{x}}^{1}$. This is a spatially dense loss that enables learning representations with dense contextual reasoning. (2) A masked cross-view average assignment prediction loss $L_{\text{IMG\xspace}}$, which is to predict with the global image embedding of the first view $\tilde{\mathbf{x}}^{1}$ the average assignment vector of the opposite view $\mathbf{x}^2$ ('GAP' stands for Global Average Pooling). This is an image-wise loss that promotes learning image representations that are invariant with respect to different augmentations of the input. The same objectives are applied in a symmetric way when the student gets as input the masked image version $\tilde{\mathbf{x}}^2$ of view $\mathbf{x}^2$ (not shown). We implement the $L_{\text{LOC\xspace}}$ objective with a condenser-based decoder that gets as input patch-token embeddings from an intermediate layer of the student encoder and the global image embeddings from its last layer (see \ref{['sec:alt_rec_appoaches']}). The teacher is an exponential moving average ('EMA') of student.
  • Figure 3: COCO detection and instance segmentation with ViT-B/16.$\dagger$: use of multiple crops caron2021emerging.