Table of Contents
Fetching ...

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, Oncel Tuzel

TL;DR

This work tackles the challenge of deploying powerful Vision Foundation Models (VFMs) in settings with limited labeled data and constrained compute by proposing task-oriented knowledge transfer. The method adapts a VFM to the target task with a task-specific head, then distills task-oriented knowledge to a small model using a large unlabeled transfer set before final finetuning on limited labels. Across five tasks, the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining, while offering substantial training-cost reductions. A core insight is that task-relevant transfer sets—and, when scarce, retrieval-augmented curation of such sets—significantly boost performance, enabling practical, efficient deployment of small models in specialized domains.

Abstract

Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data?", and propose a simple task-oriented knowledge transfer approach as a highly effective solution to this problem. Our experimental results on five target tasks show that the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining by up to 11.6%, 22.1%, 13.7%, and 29.8%, respectively. Furthermore, the proposed approach also demonstrates up to 9x, 4x and 15x reduction in pretraining compute cost when compared to task-agnostic VFM distillation, ImageNet pretraining and DINO pretraining, respectively, while outperforming them. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and introduce a retrieval-augmented knowledge transfer strategy that uses web-scale image retrieval to curate effective transfer sets.

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

TL;DR

This work tackles the challenge of deploying powerful Vision Foundation Models (VFMs) in settings with limited labeled data and constrained compute by proposing task-oriented knowledge transfer. The method adapts a VFM to the target task with a task-specific head, then distills task-oriented knowledge to a small model using a large unlabeled transfer set before final finetuning on limited labels. Across five tasks, the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining, while offering substantial training-cost reductions. A core insight is that task-relevant transfer sets—and, when scarce, retrieval-augmented curation of such sets—significantly boost performance, enabling practical, efficient deployment of small models in specialized domains.

Abstract

Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data?", and propose a simple task-oriented knowledge transfer approach as a highly effective solution to this problem. Our experimental results on five target tasks show that the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining by up to 11.6%, 22.1%, 13.7%, and 29.8%, respectively. Furthermore, the proposed approach also demonstrates up to 9x, 4x and 15x reduction in pretraining compute cost when compared to task-agnostic VFM distillation, ImageNet pretraining and DINO pretraining, respectively, while outperforming them. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and introduce a retrieval-augmented knowledge transfer strategy that uses web-scale image retrieval to curate effective transfer sets.
Paper Structure (31 sections, 12 figures, 13 tables)

This paper contains 31 sections, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Downstream task performance (EuroSAT classification) of FastViT target model with different pretraining approaches. Here, we finetune the pretrained models using 10 labeled training images per class. Task-oriented knowledge transfer from DINOv2 VFM with generic web data (CC3M) outperforms the popular ImageNet, CLIP and DINO pretraining approaches. Knowledge transfer with a transfer set curated using image retrieval performs significantly better than knowledge transfer with generic web data.
  • Figure 2: Top-left: Proposed task-oriented knowledge transfer approach that (a) first teaches the target task to the VFM using labeled target task data, (b) then uses this VFM to pretrain the target model by matching their target task predictions on an unlabeled transfer dataset, and (c) finally finetunes the target model using labeled target task data. Bottom-left: Alternative task-agnostic knowledge transfer approach that (d) first pretrains the target model by matching its features to the features extracted by the VFM on an unlabeled transfer dataset, and (e) then finetunes it using labeled target task data. Right: Transfer set curation using query-balanced image crop retrieval with a small target task dataset as the query set and a web-scale gallery set. By retrieving equal number of samples for each query, this approach increases the diversity of the retrieved samples. We perform crop-level retrieval to increase the chances of finding good matches.
  • Figure 3: (a) Comparison of various approaches for different (VFM, transfer set) combinations with FastViT as the target image encoder. (b) Performance improvement when unlabelled target task data is used instead of generic CC3M dataset for task-oriented knowledge transfer. The target tasks are HAM10K classification, EuroSAT classification, Places365 classification, ImageNet classification and ADE20K segmentation from top to bottom. Task-oriented knowledge transfer from VFMs (red curves) clearly outperforms alternative training strategies. The performance of finetuned VFMs used for knowledge transfer is also shown here for reference (black curves). When the target task is ImageNet classification, the blue curve corresponds to training from scratch instead of ImageNet pretraining.
  • Figure 4: Comparison of various approaches in terms of their pretraining compute. The left two figures correspond to Places365 classification (250 training images per class) and the right two figures correspond to EuroSAT classification (10 training images per class). Here, we use DINOv2 VFM for knowledge transfer and FastViT as the target architecture. Each curve in this figure was obtained by evaluating intermediate checkpoints of one training run. CLIP pretraining is represented using dashed green line.
  • Figure 5: Performance of task-oriented knowledge transfer using retrieval augmented transfer sets of varying sizes. The number of labeled images used for finetuning and also as retrieval queries is 4800 for ADE20K dataset and 100 for EuroSAT dataset.
  • ...and 7 more figures