Table of Contents
Fetching ...

Transferring Knowledge from Large Foundation Models to Small Downstream Models

Shikai Qiu, Boran Han, Danielle C. Maddix, Shuai Zhang, Yuyang Wang, Andrew Gordon Wilson

TL;DR

Adaptive Feature Transfer (AFT) tackles the challenge of transferring knowledge from very large foundation models to small, cost-efficient downstream models by regularizing learning in feature space rather than through weight initialization or output distillation. It introduces a kernel-based objective that learns a diagonal feature-weighting map μ to align downstream features with frozen pre-trained features, effectively selecting task-relevant information from multiple sources and across architectures with minimal overhead. Empirically, AFT yields substantial improvements across vision, language, and multi-modal tasks, and notably translates improvements in pre-trained models into downstream gains even when the downstream model is over 50× smaller, while enabling combinations of complementary features. The work highlights the practical impact for deploying foundation-model knowledge at reduced computational cost and outlines future extensions to broaden transferred feature sets and cross-domain applicability.

Abstract

How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over $50\times$ smaller, and can effectively transfer complementary information learned by multiple pre-trained models.

Transferring Knowledge from Large Foundation Models to Small Downstream Models

TL;DR

Adaptive Feature Transfer (AFT) tackles the challenge of transferring knowledge from very large foundation models to small, cost-efficient downstream models by regularizing learning in feature space rather than through weight initialization or output distillation. It introduces a kernel-based objective that learns a diagonal feature-weighting map μ to align downstream features with frozen pre-trained features, effectively selecting task-relevant information from multiple sources and across architectures with minimal overhead. Empirically, AFT yields substantial improvements across vision, language, and multi-modal tasks, and notably translates improvements in pre-trained models into downstream gains even when the downstream model is over 50× smaller, while enabling combinations of complementary features. The work highlights the practical impact for deploying foundation-model knowledge at reduced computational cost and outlines future extensions to broaden transferred feature sets and cross-domain applicability.

Abstract

How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over smaller, and can effectively transfer complementary information learned by multiple pre-trained models.
Paper Structure (30 sections, 6 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 6 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Adaptive Feature Transfer (AFT) transfers knowledge from large foundation models into small downstream models, improving downstream performance with minimal cost. (a) AFT regularizes the downstream model to prioritize learning the task-relevant subset of pre-trained features ($\mathrm{blue} \cap \mathrm{red}$) over entirely new features ($\mathrm{red} \setminus \mathrm{blue}$). The blue region represents information in pre-trained features, red represents information in downstream features, and inside the square boundary represents all information in the raw, uncompressed input. (b) Over 6 vision datasets and 8 NLP datasets, AFT significantly outperforms standard transfer learning (STL), knowledge distillation (KD) hinton2015distillingromero2014fitnets, including its more sophisticated variants relational knowledge distillation (RKD) park2019relational and factor transfer (FT) kim2018paraphrasing, and B-Tuning you2022ranking. Error is normalized by STL error and averaged over datasets and downstream models, including ViT-S, MLP Mixer-B, ResNet-50, BERT-S, and DistillBERT. Error bars show standard errors across models and datasets. (c) AFT is the most effective at translating improvements in pre-trained models to improvements in downstream performance. See \ref{['sec:experiments']} for experiment details.
  • Figure 2: Evaluation on 6 vision datasets using ViT-S, MLP-Mixer-B, and ResNet-50 as downstream models. (a) AFT achieves the lowest normalized error, averaged across all 6 datasets, 3 downstream models, and 3 seeds when transferring from DINOv2 ViT-G/14. The error is normalized by the STL error before averaging. Error bars show standard errors of the aggregated performance. (b) Breakdown of unnormalized error for each downstream model and dataset. Error bars show standard errors across 3 seeds. (c, d) On CIFAR-100, AFT further improves from combining multiple pre-trained models.
  • Figure 3: CIFAR-100 downstream accuracy vs linear probe accuracy of pre-trained features, averaged across 3 downstream models. AFT most effectively translates improvements in pre-trained models to improvements in downstream performance. Marker size is proportional to the number of parameters in the pre-trained models, ranging from 87 million to 2.7 billion.
  • Figure 4: Evaluation on 8 language dataset using BERT Small and DistillBert as downstream models. (a) AFT achieves the lowest normalized error, averaged across 6 datasets, 2 downstream models, and 3 seeds, when transferring from Flan-T5 Large. The error is normalized by the STL error before averaging. The error is normalized by the STL error before averaging. Error bars show standard errors of the aggregated performance. (b) Breakdown of unnormalized error for each downstream model and dataset. Error bars show standard errors across 3 seeds.
  • Figure 5: BoolQ downstream accuracy v.s. linear probe accuracy of pre-trained features, averaged across two downstream models on BoolQ. AFT most effectively translates improvements in pre-trained models to improvements in downstream performance. Marker size is proportional to the log of the number of parameters in the pre-trained models, ranging from 61 million to 14 billion.
  • ...and 2 more figures