Table of Contents
Fetching ...

Landscape-Aware Growing: The Power of a Little LAG

Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar

TL;DR

This work tackles the problem of how to best grow Transformer models during pretraining by challenging the conventional reliance on loss-preserving growth. It shows that initial loss is a weak predictor of final performance, while the early loss landscape near the new initialization contains a strong, predictive signal, including a measurable phase transition. The authors introduce Landscape-Aware Growing (LAG) and demonstrate its effectiveness for both single-step growth and adaptive stacking, achieving near-optimal strategy selection with limited additional training. The findings offer a practical approach to more efficient, stagewise pretraining and suggest avenues for adaptive, data-driven growth in large-scale models with potential broad impact on future research and applications.

Abstract

Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.

Landscape-Aware Growing: The Power of a Little LAG

TL;DR

This work tackles the problem of how to best grow Transformer models during pretraining by challenging the conventional reliance on loss-preserving growth. It shows that initial loss is a weak predictor of final performance, while the early loss landscape near the new initialization contains a strong, predictive signal, including a measurable phase transition. The authors introduce Landscape-Aware Growing (LAG) and demonstrate its effectiveness for both single-step growth and adaptive stacking, achieving near-optimal strategy selection with limited additional training. The findings offer a practical approach to more efficient, stagewise pretraining and suggest avenues for adaptive, data-driven growth in large-scale models with potential broad impact on future research and applications.

Abstract

Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.
Paper Structure (20 sections, 2 equations, 4 figures, 4 tables)

This paper contains 20 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of growing a network. Left: Generic growth of $k$ layers into $2k$ layers. Middle: Example with $L=6,k=3,i=2,b=1$, parameter duplication (interleaving). Right: Example with $L=6,k=3,i=2,b=3$, parameter duplication (single-block copying).
  • Figure 2: Growing a 12-layer BERT model at step 500,000 into a 16-layer BERT model and then training the larger model for 100K steps. Correlation (in validation loss) between (left) the loss at 600K steps and the loss immediately upon growing (i.e., without any training) and (right) the loss at 600K steps and the loss at 505K steps (i.e., after 5,000 steps of training the larger model).
  • Figure 3: Growing BERT from 12 to 16 layers: zooming in on steps 500,000 through 500,200. Spearman correlation heatmap (top left), Spearman correlation with final values (top right), Recall@$k$ (bottom left), Relative regret (bottom right). See below for details on how these plots were constructed. For all plots, at each step, the validation loss is first averaged over a window of 11 steps (centered at the step in question) to help smooth out noise.
  • Figure 4: Growing UL2 from 12 layers to 16 layers. Spearman correlation heatmap (left) and Spearman correlation with final values (right). Here, the validation loss is only measured every 100 steps, so these plots do not use smoothing (in contrast with Figure \ref{['fig:bert_12_to_16_deep_dive']}).