Table of Contents
Fetching ...

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Zhiqi Bu

TL;DR

This work proposes zero/one-layer progressive training for the optimal tradeoff between computation and loss, and offers insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion.

Abstract

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently accelerate $\approx 5\times$ while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

TL;DR

This work proposes zero/one-layer progressive training for the optimal tradeoff between computation and loss, and offers insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion.

Abstract

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save compute, or equivalently accelerate while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.

Paper Structure

This paper contains 32 sections, 11 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Zero-layer (red, 39M or 0.15B) and one-layer (blue and green, 46M or 0.27B) progressive training can achieve significant speedup over fixed-sized training (black, 12-layer 124M or 60-layer 7B) on GPT2 pre-trained on OpenWebText under WSD schedule. The difference in final validation loss is $<0.5\%$ for 124M runs and $<0.2\%$ for 7B runs. The depth expansion takes place at 80% of iterations for full runs, and at 2% of iterations immediately after warmup for early stopped runs.
  • Figure 2: Convergence of zero/one-layer progressive training and fixed-size training. Left: ResNet with depth expansion at 32-th epoch. Right: GPT2 with depth expansion at 50k iterations.
  • Figure 3: Validation (solid) and training loss (dashed) at different learning rates of Muon-NSGD.
  • Figure 4: Convergence of multi-layer progressive training and fixed-size training. Left: ResNet with depth expansion at 32/64-th epoch. Right: GPT2 with depth expansion at 30/70k iterations.
  • Figure 5: Performance of zero-layer progressive training and fixed-size training, where WSD schedule significantly enhances the progressive training. See one-layer results in \ref{['app:insights']}.
  • ...and 15 more figures