Table of Contents
Fetching ...

One-for-All Model Initialization with Frequency-Domain Knowledge

Jianlu Shen, Fu Feng, Yucheng Xie, Jiaqi Lv, Xin Geng

TL;DR

It is empirically demonstrated that a model's foundational, task-agnostic knowledge, its"learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models.

Abstract

Transferring knowledge by fine-tuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we empirically demonstrate that a model's foundational, task-agnostic knowledge, its "learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the Discrete Cosine Transform (DCT) to isolate the low-frequency "learngene". This learngene can be seamlessly adapted to initialize models of arbitrary size via simple truncation or padding, a process that is entirely training-free. For enhanced performance, we propose an optional low-cost refinement process that introduces a spectral regularizer to further improve the learngene's transferability. Extensive experiments demonstrate that FRONT achieves the state-of-the-art performance, accelerates convergence by up to 15 times in vision tasks, and reduces training FLOPs by an average of 40.5% in language tasks.

One-for-All Model Initialization with Frequency-Domain Knowledge

TL;DR

It is empirically demonstrated that a model's foundational, task-agnostic knowledge, its"learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models.

Abstract

Transferring knowledge by fine-tuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we empirically demonstrate that a model's foundational, task-agnostic knowledge, its "learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the Discrete Cosine Transform (DCT) to isolate the low-frequency "learngene". This learngene can be seamlessly adapted to initialize models of arbitrary size via simple truncation or padding, a process that is entirely training-free. For enhanced performance, we propose an optional low-cost refinement process that introduces a spectral regularizer to further improve the learngene's transferability. Extensive experiments demonstrate that FRONT achieves the state-of-the-art performance, accelerates convergence by up to 15 times in vision tasks, and reduces training FLOPs by an average of 40.5% in language tasks.
Paper Structure (57 sections, 22 equations, 12 figures, 15 tables)

This paper contains 57 sections, 22 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: (a) Five different-scaled models initialized from the same DeiT are fine-tuned on the same downstream tasks. (b) A pre-trained DeiT is fine-tuned on five different downstream tasks. We plot the cosine similarity between the low-frequency components of the fine-tuned weights and their original pre-trained state over the training process. Each dashed line represents an individual fine-tuned model, while the solid line indicates their average.
  • Figure 2: The overview of FRONT and FRONT+. The learngenes are obtained from either FRONT via the open available pre-trained models or FRONT+ using source models refined with our frequency regularization. The core process involves transforming weight modules via DCT to extract task-agnostic knowledge. Finally, these learngenes are used to initialize diverse target models through the IDCT, which flexibly accommodate varying architectures via zero-padding or truncation.
  • Figure 3: Comparison of different direct initialization methods during full training on DeiT-Ti and DeiT-S.
  • Figure 4: Results of pretraining BRET-S, RoBERTa-S and GPT-S. FRONT can achieve the highest savings in FLOPs with 47.5% for BERT-S, 42.9% for RoBERTa-S, and 31.0% for GPT-S from the Scratch models.
  • Figure 5: Achieving Superior Performance with Efficient Refinement.
  • ...and 7 more figures