MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments
Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez
TL;DR
MOCA presents a novel self-supervised framework for Vision Transformers that unifies dense contextual reasoning with perturbation invariance by predicting mask-based online codebook assignments. It leverages a teacher-student EMA to generate high-level token assignment targets over an online codebook and optimizes two complementary losses, $L_{ ext{LOC}}$ and $L_{ ext{IMG}}$, balanced by a parameter to encourage both local and global consistency. Key innovations include simple online codebook construction with dynamic prototypes $W^d$ and $W^b$ generated by $G^d$ and $G^b$, condenser-based decoding to promote spatial structure, and partial decoding to improve training efficiency. Empirically, MOCA achieves state-of-the-art or competitive results in linear, k-NN, and low-shot settings on ImageNet, with faster training than prior methods, and strong performance on segmentation and detection benchmarks, highlighting its practical impact for efficient, versatile representation learning. Overall, MOCA advances self-supervised learning for ViTs by coupling high-level, mask-based predictions with online codebooks, enabling robust, transferable visual representations with reduced computational cost.
Abstract
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods. We provide the implementation code at https://github.com/valeoai/MOCA.
