Table of Contents
Fetching ...

Learning to Grow Pretrained Models for Efficient Transformer Training

Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim

TL;DR

The paper addresses the high cost of training ever-larger transformers by reusing knowledge from smaller pretrained models. It introduces LiGO, a learnable linear growth operator that initializes a larger model from a smaller one via a depth-width decomposition with Kronecker factorization, and connects this approach to Monarch matrices. Across language and vision domains, LiGO achieves substantial FLOPs and wall-time savings (up to about 55% in some cases) while preserving downstream performance, showing broad applicability and compatibility with other efficiency techniques. This work paves the way for scalable, data-driven pretraining of progressively larger transformers without training from scratch, with potential applicability to billion-parameter models.

Abstract

Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.

Learning to Grow Pretrained Models for Efficient Transformer Training

TL;DR

The paper addresses the high cost of training ever-larger transformers by reusing knowledge from smaller pretrained models. It introduces LiGO, a learnable linear growth operator that initializes a larger model from a smaller one via a depth-width decomposition with Kronecker factorization, and connects this approach to Monarch matrices. Across language and vision domains, LiGO achieves substantial FLOPs and wall-time savings (up to about 55% in some cases) while preserving downstream performance, showing broad applicability and compatibility with other efficiency techniques. This work paves the way for scalable, data-driven pretraining of progressively larger transformers without training from scratch, with potential applicability to billion-parameter models.

Abstract

Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.
Paper Structure (33 sections, 1 theorem, 10 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 1 theorem, 10 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

StackBERT (Eq. eqn:depth_stack), Interpolation (Eq. eqn:depth_stack), and Net2Net (Eq. eqn:width_net2net) are all the special cases of the LiGO operator (Eq. eqn:genral_form).

Figures (8)

  • Figure 1: Our linear growth operator (LiGO) accelerates training by using the weights of a smaller model $\boldsymbol{\Theta}$ to initialize the weights of the larger model $\boldsymbol{\Theta}^{(new)}$. LiGO is parameterized as a sparse linear map $\boldsymbol{M}$ that can be decomposed into width- and depth-expansion operators. The width-operator $\boldsymbol{R}_{width}$ and depth-operator $\boldsymbol{L}_{depth}$ are structured matrices obtained from Kronecker products of smaller matrices which encode architectural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of transformer networks. In our approach, we learn the growth matrix $\boldsymbol{M}$ with a 100 steps of SGD, use this to initialize the larger model, and then continue training as usual. Best viewed in color.
  • Figure 2: Results on BERT. (a-b) shows validation log perplexity vs. FLOPs and wall time respectively for training BERT-Base by reusing BERT-Small. (c) shows log perplexity vs. FLOPs in growing BERT-Small and BERT-Base to BERT-Large. The solid line indicates the final perplexity of the larger model trained from scratch, while the dotted line represents performance of the smaller model trained from scratch. LiGO offers about 45% savings in FLOPs and 40% savings in wall time over BERT-Base training from scratch. Our approach is also flexible in reusing either BERT-Small or BERT-Base for accelerating BERT-Large training.
  • Figure 3: Results on RoBERTa and GPT2. LiGO reduces FLOPs by $47.2\%$ and $22.5\%$ for RoBERTa-Base and GPT2-Medium, , demonstrating its effectiveness across different training strategies and architectures.
  • Figure 4: Results on DeiT. (a) Accuracy vs. flops and (b) accuracy vs. wall time for training DeiT-B. LiGO saves flops and wall time by more than 50% over training from scratch on ImageNet.
  • Figure 5: LiGO with other efficient training strategies. Our approach can be combined with (a) layer dropping, (b) token dropping, and (c) staged training (ST), for further accelerate BERT training.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1
  • Proposition 1
  • proof