Landscape-Aware Growing: The Power of a Little LAG
Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar
TL;DR
This work tackles the problem of how to best grow Transformer models during pretraining by challenging the conventional reliance on loss-preserving growth. It shows that initial loss is a weak predictor of final performance, while the early loss landscape near the new initialization contains a strong, predictive signal, including a measurable phase transition. The authors introduce Landscape-Aware Growing (LAG) and demonstrate its effectiveness for both single-step growth and adaptive stacking, achieving near-optimal strategy selection with limited additional training. The findings offer a practical approach to more efficient, stagewise pretraining and suggest avenues for adaptive, data-driven growth in large-scale models with potential broad impact on future research and applications.
Abstract
Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.
