Table of Contents
Fetching ...

Accessing Vision Foundation Models via ImageNet-1K

Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu

TL;DR

Proteus presents a simple, data-efficient distillation framework to compress vision foundation models onto ImageNet-1K without access to the native training data. By removing dataset-bias-inducing components and combining token-, patch-, and feature-level proxy objectives, it preserves general-purpose representations while using only 1.2M proxy images. Across ViT-S and scaled ViT-B/L targets, Proteus achieves competitive ImageNet performance and strong generalization on dense-prediction tasks, often approaching Oracle teacher results with far less data and compute. The approach generalizes across teachers (DINOv2, CLIP, SynCLR) and remains robust under data-diverse proxy configurations, enabling broader research access to foundation-model capabilities.

Abstract

Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named \textit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.

Accessing Vision Foundation Models via ImageNet-1K

TL;DR

Proteus presents a simple, data-efficient distillation framework to compress vision foundation models onto ImageNet-1K without access to the native training data. By removing dataset-bias-inducing components and combining token-, patch-, and feature-level proxy objectives, it preserves general-purpose representations while using only 1.2M proxy images. Across ViT-S and scaled ViT-B/L targets, Proteus achieves competitive ImageNet performance and strong generalization on dense-prediction tasks, often approaching Oracle teacher results with far less data and compute. The approach generalizes across teachers (DINOv2, CLIP, SynCLR) and remains robust under data-diverse proxy configurations, enabling broader research access to foundation-model capabilities.

Abstract

Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named \textit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.
Paper Structure (32 sections, 8 equations, 16 figures, 14 tables)

This paper contains 32 sections, 8 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Distilling from DINOv2 with only 1.2M images, Proteus matches the Oracle models' performance across 19 benchmarks at different scales and our delivered ViT-Tiny model remarkably outperforms the traditional distillation method DeiT on all metrics.
  • Figure 2: Relationship between the number of training images and average accuracy in 12 fine-grained classification datasets. Distilling from DINOv2-g/14, Proteus-L/14 outperforms OpenCLIP-L/14* (2B) using 0.06$\%$ of its training data and matches the performance of the Oracle model DINOv2-L/14. The X-axis is on a logarithmic scale.
  • Figure 3: PCA visualization. DINOv2-B/14 is distilled from DINOv2-g on LVD-142M and Proteus-B/14 is distilled from DINOv2-L on ImageNet-1K. Details can be found in Sec. \ref{['sec:visual']}.
  • Figure 4: Ablation study on combating dataset bias. ViT-S/14 models are distilled from DINOv2-B on ImageNet-1K for 300 epochs.
  • Figure 5: Ablation study on learning objectives. ViT-S/14 models are distilled from DINOv2-B on ImageNet-1K for 300 epochs.
  • ...and 11 more figures