Learning to Grow Pretrained Models for Efficient Transformer Training
Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim
TL;DR
The paper addresses the high cost of training ever-larger transformers by reusing knowledge from smaller pretrained models. It introduces LiGO, a learnable linear growth operator that initializes a larger model from a smaller one via a depth-width decomposition with Kronecker factorization, and connects this approach to Monarch matrices. Across language and vision domains, LiGO achieves substantial FLOPs and wall-time savings (up to about 55% in some cases) while preserving downstream performance, showing broad applicability and compatibility with other efficiency techniques. This work paves the way for scalable, data-driven pretraining of progressively larger transformers without training from scratch, with potential applicability to billion-parameter models.
Abstract
Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.
