Table of Contents
Fetching ...

Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

Shi-Yu Xia, Wenxuan Zhu, Xu Yang, Xin Geng

TL;DR

The paper tackles the challenge of initializing variable-sized Vision Transformer models under diverse resource constraints. It introduces Stage-wise Weight Sharing (SWS) to learn learngene layers via an Aux-Net trained with distillation from a large ancestry model, then expands these layers at corresponding stages to initialize Desc-Nets. SWS delivers consistent improvements over training-from-scratch and prior Learngene methods, while dramatically reducing training costs and parameter storage (e.g., around $6.6$${\times}$ training cost reduction and $20$${\times}$ fewer initialization parameters on ImageNet-1K). The approach highlights the importance of preserving stage information and providing explicit expansion guidance for scalable, resource-aware model deployment in real-world contexts.

Abstract

In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6.6x total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20x parameters stored to initialize these models and around 10x pre-training costs, in contrast to the pre-training and fine-tuning approach.

Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

TL;DR

The paper tackles the challenge of initializing variable-sized Vision Transformer models under diverse resource constraints. It introduces Stage-wise Weight Sharing (SWS) to learn learngene layers via an Aux-Net trained with distillation from a large ancestry model, then expands these layers at corresponding stages to initialize Desc-Nets. SWS delivers consistent improvements over training-from-scratch and prior Learngene methods, while dramatically reducing training costs and parameter storage (e.g., around training cost reduction and fewer initialization parameters on ImageNet-1K). The approach highlights the importance of preserving stage information and providing explicit expansion guidance for scalable, resource-aware model deployment in real-world contexts.

Abstract

In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6.6x total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20x parameters stored to initialize these models and around 10x pre-training costs, in contrast to the pre-training and fine-tuning approach.
Paper Structure (14 sections, 4 equations, 5 figures, 6 tables)

This paper contains 14 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) Learngene. (b) Simple-LG and (c) SWS, here we take 3-layer learngenes as an example. (d) Visualization of validation loss value.
  • Figure 2: In the first phase, we build an auxiliary model comprising multiple stages. The layer weights in each stage are shared. Note that the number of layers in each stage and the number of stages are both configurable. Then we train it via distillation. After the learngene learning process, learngene layers containing stage information and expansion guidance are adopted to initialize descendant models of variable depths in the second phase. Finally, these models are fine-tuned normally and deployed to practical scenarios with diverse resource constraints.
  • Figure 3: Taking $M$ = 3 as an example, we show (a) Layer assignment strategy, (b) Initialization strategy and (c) Initialization order.
  • Figure 4: Performance comparisons on ImageNet-1K between several baselines and SWS. Number in bracket of (a)-(t) means Params(M).
  • Figure 5: Performance comparisons on several downstream classification datasets of (a)-(d): Des-S-12 and (e)-(h): Des-B-12.