Table of Contents
Fetching ...

WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models

Fu Feng, Yucheng Xie, Jing Wang, Xin Geng

TL;DR

WAVE reframes variable-sized model initialization as a multi-task problem by learning universal weight templates (learngenes) and small, size-specific weight scalers. Knowledge from pre-trained ancestry models is integrated into these templates via distillation with a Kronecker-product reconstruction, enabling consistent initialization across depth and width variations with only a few thousand trainable parameters for scalers. Empirical results show state-of-the-art initialization performance, substantial compute savings (e.g., outperforming 150-epoch direct pretraining with only 10 epochs after initialization), and strong transferability to diverse downstream datasets, with the weight templates exhibiting structured, size-agnostic knowledge. The approach demonstrates robustness across deeper networks and different template configurations, highlighting practical impact for scalable deployment of variable-sized Vision Transformers.

Abstract

The growing complexity of model parameters underscores the significance of pre-trained models. However, deployment constraints often necessitate models of varying sizes, exposing limitations in the conventional pre-training and fine-tuning paradigm, particularly when target model sizes are incompatible with pre-trained ones. To address this challenge, we propose WAVE, a novel approach that reformulates variable-sized model initialization from a multi-task perspective, where initializing each model size is treated as a distinct task. WAVE employs shared, size-agnostic weight templates alongside size-specific weight scalers to achieve consistent initialization across various model sizes. These weight templates, constructed within the Learngene framework, integrate knowledge from pre-trained models through a distillation process constrained by Kronecker-based rules. Target models are then initialized by concatenating and weighting these templates, with adaptive connection rules established by lightweight weight scalers, whose parameters are learned from minimal training data. Extensive experiments demonstrate the efficiency of WAVE, achieving state-of-the-art performance in initializing models of various depth and width. The knowledge encapsulated in weight templates is also task-agnostic, allowing for seamless transfer across diverse downstream datasets. Code will be made available at https://github.com/fu-feng/WAVE.

WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models

TL;DR

WAVE reframes variable-sized model initialization as a multi-task problem by learning universal weight templates (learngenes) and small, size-specific weight scalers. Knowledge from pre-trained ancestry models is integrated into these templates via distillation with a Kronecker-product reconstruction, enabling consistent initialization across depth and width variations with only a few thousand trainable parameters for scalers. Empirical results show state-of-the-art initialization performance, substantial compute savings (e.g., outperforming 150-epoch direct pretraining with only 10 epochs after initialization), and strong transferability to diverse downstream datasets, with the weight templates exhibiting structured, size-agnostic knowledge. The approach demonstrates robustness across deeper networks and different template configurations, highlighting practical impact for scalable deployment of variable-sized Vision Transformers.

Abstract

The growing complexity of model parameters underscores the significance of pre-trained models. However, deployment constraints often necessitate models of varying sizes, exposing limitations in the conventional pre-training and fine-tuning paradigm, particularly when target model sizes are incompatible with pre-trained ones. To address this challenge, we propose WAVE, a novel approach that reformulates variable-sized model initialization from a multi-task perspective, where initializing each model size is treated as a distinct task. WAVE employs shared, size-agnostic weight templates alongside size-specific weight scalers to achieve consistent initialization across various model sizes. These weight templates, constructed within the Learngene framework, integrate knowledge from pre-trained models through a distillation process constrained by Kronecker-based rules. Target models are then initialized by concatenating and weighting these templates, with adaptive connection rules established by lightweight weight scalers, whose parameters are learned from minimal training data. Extensive experiments demonstrate the efficiency of WAVE, achieving state-of-the-art performance in initializing models of various depth and width. The knowledge encapsulated in weight templates is also task-agnostic, allowing for seamless transfer across diverse downstream datasets. Code will be made available at https://github.com/fu-feng/WAVE.

Paper Structure

This paper contains 39 sections, 15 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Multi-task learning typically relies on a universal backbone with task-agnostic knowledge, complemented by a few trainable adapters for task-specific adaptation. (b) WAVE reformulates variable-sized model initialization as a multi-task problem by treating the initialization of each model size as a distinct task. By doing so, WAVE employs shared weight templates encapsulating size-agnostic knowledge, along with a few trainable weight scalers for size-specific initialization across various model sizes.
  • Figure 2: (a) The knowledge in pre-trained models is integrated into structured knowledge within weight templates through distillation, assisted by an auxiliary model constrained by the rules in Eq.\ref{['equ:kro']}, where $\otimes$ is the Kronecker product. (b) For initializing models of variable sizes, only the corresponding weight scalers are initialized according to the target model size. These scalers are trained with a small amount of data to learn connection rules of templates, while the weight templates remain frozen to retain the structured knowledge.
  • Figure 3: Compared with Direct Pre-training. (a) Comparison of models initialized by WAVE and trained for 10 epochs versus those directly pre-trained for 150 epochs across 15 downstream models of varying sizes. (b) Analysis of computational cost as the number of initialized models increases. (c) Detailed training process (300 epochs) of models initialized by WAVE and direct pre-training.
  • Figure 4: Visualization of structured knowledge. (a) Knowledge in self-attention layers. (b) Relationships between layer position and corresponding parameter values after PCA.
  • Figure 5: Visualization of knowledge encapsulated in weight templates. All networks operate directly after initialization without any additional training or fine-tuning.
  • ...and 2 more figures