Unsupervised Transformer Pre-Training for Images: Self-Distillation, Mean Teachers, and Random Crops
Mattia Scardecchia
TL;DR
This work surveys unsupervised visual representation learning centered on the DINO family, detailing how self-distillation with a mean-teacher EMA, multi-crop augmentations, and transformer backbones yield robust, general-purpose features without labels. It traces the lineage from DINO to iBOT and the large-scale DINOv2, highlighting how architectural choices, normalization tricks, and data curation enable strong off-the-shelf transfer and dense-prediction performance. Key contributions include empirical comparisons across SSL and weakly supervised methods, qualitative analyses of attention and feature structure, and a synthesis of extensions that scale the approach. The findings demonstrate that scale and carefully designed self-supervised objectives can approach or surpass weak supervision on a wide range of vision tasks, with emergent capabilities such as explicit object-boundary awareness in ViT representations.
Abstract
Recent advances in self-supervised learning (SSL) have made it possible to learn general-purpose visual features that capture both the high-level semantics and the fine-grained spatial structure of images. Most notably, the recent DINOv2 has established a new state of the art by surpassing weakly supervised methods (WSL) like OpenCLIP on most benchmarks. In this survey, we examine the core ideas behind its approach, multi-crop view augmentation and self-distillation with a mean teacher, and trace their development in previous work. We then compare the performance of DINO and DINOv2 with other SSL and WSL methods across various downstream tasks, and highlight some remarkable emergent properties of their learned features with transformer backbones. We conclude by briefly discussing DINOv2's limitations, its impact, and future research directions.
