Table of Contents
Fetching ...

Unsupervised Transformer Pre-Training for Images: Self-Distillation, Mean Teachers, and Random Crops

Mattia Scardecchia

TL;DR

This work surveys unsupervised visual representation learning centered on the DINO family, detailing how self-distillation with a mean-teacher EMA, multi-crop augmentations, and transformer backbones yield robust, general-purpose features without labels. It traces the lineage from DINO to iBOT and the large-scale DINOv2, highlighting how architectural choices, normalization tricks, and data curation enable strong off-the-shelf transfer and dense-prediction performance. Key contributions include empirical comparisons across SSL and weakly supervised methods, qualitative analyses of attention and feature structure, and a synthesis of extensions that scale the approach. The findings demonstrate that scale and carefully designed self-supervised objectives can approach or surpass weak supervision on a wide range of vision tasks, with emergent capabilities such as explicit object-boundary awareness in ViT representations.

Abstract

Recent advances in self-supervised learning (SSL) have made it possible to learn general-purpose visual features that capture both the high-level semantics and the fine-grained spatial structure of images. Most notably, the recent DINOv2 has established a new state of the art by surpassing weakly supervised methods (WSL) like OpenCLIP on most benchmarks. In this survey, we examine the core ideas behind its approach, multi-crop view augmentation and self-distillation with a mean teacher, and trace their development in previous work. We then compare the performance of DINO and DINOv2 with other SSL and WSL methods across various downstream tasks, and highlight some remarkable emergent properties of their learned features with transformer backbones. We conclude by briefly discussing DINOv2's limitations, its impact, and future research directions.

Unsupervised Transformer Pre-Training for Images: Self-Distillation, Mean Teachers, and Random Crops

TL;DR

This work surveys unsupervised visual representation learning centered on the DINO family, detailing how self-distillation with a mean-teacher EMA, multi-crop augmentations, and transformer backbones yield robust, general-purpose features without labels. It traces the lineage from DINO to iBOT and the large-scale DINOv2, highlighting how architectural choices, normalization tricks, and data curation enable strong off-the-shelf transfer and dense-prediction performance. Key contributions include empirical comparisons across SSL and weakly supervised methods, qualitative analyses of attention and feature structure, and a synthesis of extensions that scale the approach. The findings demonstrate that scale and carefully designed self-supervised objectives can approach or surpass weak supervision on a wide range of vision tasks, with emergent capabilities such as explicit object-boundary awareness in ViT representations.

Abstract

Recent advances in self-supervised learning (SSL) have made it possible to learn general-purpose visual features that capture both the high-level semantics and the fine-grained spatial structure of images. Most notably, the recent DINOv2 has established a new state of the art by surpassing weakly supervised methods (WSL) like OpenCLIP on most benchmarks. In this survey, we examine the core ideas behind its approach, multi-crop view augmentation and self-distillation with a mean teacher, and trace their development in previous work. We then compare the performance of DINO and DINOv2 with other SSL and WSL methods across various downstream tasks, and highlight some remarkable emergent properties of their learned features with transformer backbones. We conclude by briefly discussing DINOv2's limitations, its impact, and future research directions.

Paper Structure

This paper contains 17 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: DINO algorithm without multi-crop. Two views of the same input are processed by student and teacher encoders, sharing same architecture but different parameters. The teacher output is centered using batch statistics, then both outputs are normalized with a temperature softmax. Teacher weights are an EMA of the student's. Embeddings similarity is computed as a cross-entropy. (Left) PyTorch pseudocode of DINO. (Right) Diagram of DINO. Figures from caronEmergingPropertiesSelfSupervised2021.
  • Figure 2: View Augmentation in discriminative SSL for images. (Left): Illustration of common stochastic data augmentation operations used for view augmentation. (Right): Random cropping generates semantically rich view correspondences, including adjacency and global-local relationships. Figures from chenSimpleFrameworkContrastive2020.
  • Figure 3: Self-attention patterns of a ViT trained with DINO. Visualization of the self-attention weights in the last layer of a ViT-S trained with DINO. (Left): Response to the query of the [CLS] token, with different heads encoded using different colors. Each head focuses on different objects or parts (Right): Responses to the queries of several patch tokens. The network has learned to separate objects. Figures from caronEmergingPropertiesSelfSupervised2021.
  • Figure 4: Comparison of attention masks between DINO and SL. The response to the [CLS] token in the last self-attention layer of a ViT is considered. Different columns show different attention heads. Figures from caronEmergingPropertiesSelfSupervised2021.
  • Figure 5: Evolution of some metrics during training with DINO. (Left) Comparison of top-1 accuracy on INet-1k with kNN protocol, using teacher and student frozen embeddings. (Right) Entropy of teacher embeddings and KL divergence between teacher and student embeddings using only centering, only sharpening, or both for teacher targets. Figures from caronEmergingPropertiesSelfSupervised2021.
  • ...and 2 more figures