FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Jiaqi Zhang; Juntuo Wang; Zhixin Sun; John Zou; Randall Balestriero

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero

TL;DR

This work addresses the compute barrier of self-supervised pretraining for vision models by introducing a frequency-based curriculum for DINOv2. It uses a two-stage strategy: first train on downsampled low-frequency content for the initial portion of training, then switch to full-resolution images with Gaussian noise patching to promote robustness. Empirically, it achieves substantial speedups (about 1.6x faster pretraining and 2.25x fewer FLOPs on ImageNet-1K) while maintaining competitive clean accuracy and robustness on ImageNet-C, and scalable results extend to ViT-B/16 with ImageNet-1K. The findings demonstrate that deliberate data curricula and targeted augmentations can yield robust self-supervised learning without requiring extreme model scaling, with potential for adaptive scheduling and broader applicability.

Abstract

Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

TL;DR

Abstract

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)