WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models
Fu Feng, Yucheng Xie, Jing Wang, Xin Geng
TL;DR
WAVE reframes variable-sized model initialization as a multi-task problem by learning universal weight templates (learngenes) and small, size-specific weight scalers. Knowledge from pre-trained ancestry models is integrated into these templates via distillation with a Kronecker-product reconstruction, enabling consistent initialization across depth and width variations with only a few thousand trainable parameters for scalers. Empirical results show state-of-the-art initialization performance, substantial compute savings (e.g., outperforming 150-epoch direct pretraining with only 10 epochs after initialization), and strong transferability to diverse downstream datasets, with the weight templates exhibiting structured, size-agnostic knowledge. The approach demonstrates robustness across deeper networks and different template configurations, highlighting practical impact for scalable deployment of variable-sized Vision Transformers.
Abstract
The growing complexity of model parameters underscores the significance of pre-trained models. However, deployment constraints often necessitate models of varying sizes, exposing limitations in the conventional pre-training and fine-tuning paradigm, particularly when target model sizes are incompatible with pre-trained ones. To address this challenge, we propose WAVE, a novel approach that reformulates variable-sized model initialization from a multi-task perspective, where initializing each model size is treated as a distinct task. WAVE employs shared, size-agnostic weight templates alongside size-specific weight scalers to achieve consistent initialization across various model sizes. These weight templates, constructed within the Learngene framework, integrate knowledge from pre-trained models through a distillation process constrained by Kronecker-based rules. Target models are then initialized by concatenating and weighting these templates, with adaptive connection rules established by lightweight weight scalers, whose parameters are learned from minimal training data. Extensive experiments demonstrate the efficiency of WAVE, achieving state-of-the-art performance in initializing models of various depth and width. The knowledge encapsulated in weight templates is also task-agnostic, allowing for seamless transfer across diverse downstream datasets. Code will be made available at https://github.com/fu-feng/WAVE.
