Masked Structural Growth for 2x Faster Language Model Pre-training

Yiqun Yao; Zheng Zhang; Jing Li; Yequan Wang

Masked Structural Growth for 2x Faster Language Model Pre-training

Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang

TL;DR

The paper tackles the heavy cost of pre-training large language models by introducing Masked Structural Growth (MSG), a framework that progressively expands Transformer architectures across four growth dimensions while maintaining strict function-preservation through masking. MSG simultaneously offers growth operators for all dimensions and grid-search-based schedules, achieving substantial speed-ups (up to 2.2x on Bert-large and 1.4x on GPT-2) with downstream performance comparable or superior to training from scratch. A key contribution is decoupling function preservation from new weight initialization, enabling flexible, initialization-agnostic growth and improved training dynamics. The work demonstrates MSG’s practicality and sets a foundation for future research on adaptive growth schedules and scaling to very large models.

Abstract

Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.

Masked Structural Growth for 2x Faster Language Model Pre-training

TL;DR

Abstract

Paper Structure (37 sections, 27 equations, 5 figures, 10 tables)

This paper contains 37 sections, 27 equations, 5 figures, 10 tables.

Introduction
Preliminaries
Task Formulation
Growth Dimensions of Transformers
Function-preservation
The Layer Normalization Dilemma
The Depth Dimension
Dependency on Initialization
Masked Structural Growth
The MSG Operators
Fully-Connected Layers
LN Solution
Growth of Self-attention Heads
Depth Growth
Growth Schedule
...and 22 more sections

Figures (5)

Figure 1: MSG (right) vs. Net2Net (middle) in the expansion of fully-connected layers.
Figure 2: Training loss curves with different structural hyperparameters. We study the impact of each growth dimension on the model's "pre-training rate" $\gamma$ in early stages.
Figure 3: Training loss curves of MSG with ans without mask on the Rapid-L schedule.
Figure 4: Training curve of the 101B model.
Figure 5: MSG loss curves. The pink vertical lines mark the beginning of each growth stage.

Masked Structural Growth for 2x Faster Language Model Pre-training

TL;DR

Abstract

Masked Structural Growth for 2x Faster Language Model Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)