Table of Contents
Fetching ...

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero

TL;DR

This work addresses the compute barrier of self-supervised pretraining for vision models by introducing a frequency-based curriculum for DINOv2. It uses a two-stage strategy: first train on downsampled low-frequency content for the initial portion of training, then switch to full-resolution images with Gaussian noise patching to promote robustness. Empirically, it achieves substantial speedups (about 1.6x faster pretraining and 2.25x fewer FLOPs on ImageNet-1K) while maintaining competitive clean accuracy and robustness on ImageNet-C, and scalable results extend to ViT-B/16 with ImageNet-1K. The findings demonstrate that deliberate data curricula and targeted augmentations can yield robust self-supervised learning without requiring extreme model scaling, with potential for adaptive scheduling and broader applicability.

Abstract

Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

TL;DR

This work addresses the compute barrier of self-supervised pretraining for vision models by introducing a frequency-based curriculum for DINOv2. It uses a two-stage strategy: first train on downsampled low-frequency content for the initial portion of training, then switch to full-resolution images with Gaussian noise patching to promote robustness. Empirically, it achieves substantial speedups (about 1.6x faster pretraining and 2.25x fewer FLOPs on ImageNet-1K) while maintaining competitive clean accuracy and robustness on ImageNet-C, and scalable results extend to ViT-B/16 with ImageNet-1K. The findings demonstrate that deliberate data curricula and targeted augmentations can yield robust self-supervised learning without requiring extreme model scaling, with potential for adaptive scheduling and broader applicability.

Abstract

Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

Paper Structure

This paper contains 29 sections, 4 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The FastDINOv2 training pipeline comprises two stages. In the first stage, the initial 75% of training epochs utilize only low-frequency features extracted via downsampling. In the second stage, the remaining 25% of epochs employ full-resolution images with Gaussian noise patching.
  • Figure 2: Fourier error sensitivity heatmap of model trained with our method and DINOv2 baseline. The heatmap is generated with a subset of Imagenet-100 validation set with 5 images sampled from each class. Color indicates the error sensitivity to that specific frequency range. Low-frequency features are mainly concentrated into center area, while the region further away from center represents feature with high frequency.
  • Figure 3: Grad-CAM maps for DINOv2 baseline and FastDINOv2. The first row images are from class "house finch, linnet, Carpodacus mexicanus", and the second row from "goose". With the data curriculum, model can better capture the contour of the object. See more examples in \ref{['app:gradcam_extra']}
  • Figure 4: Extra Grad-CAM maps examples for DINOv2 baseline and FastDINOv2.
  • Figure 5: Imagenet-C examples for each corruption type.