Table of Contents
Fetching ...

Efficient Construction of Model Family through Progressive Training Using Model Expansion

Kazuki Yano, Sho Takase, Sosuke Kobayashi, Shun Kiyono, Jun Suzuki

TL;DR

The paper tackles the substantial compute burden of building a model family by introducing progressive training via model expansion, initializing each larger model from the previous smaller one. By constraining the total training cost to match the largest model trained from scratch, the method achieves roughly a 25% reduction in FLOPs while delivering comparable or improved performance, especially when applying maximum learning-rate adjustments that scale with model size. Importantly, progressive training also yields greater behavioral consistency across model sizes, as evidenced by lower KL divergences between adjacent models. These findings offer a practical, data-efficient approach to producing coherent, deployable model families suitable for diverse compute budgets and applications.

Abstract

As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.

Efficient Construction of Model Family through Progressive Training Using Model Expansion

TL;DR

The paper tackles the substantial compute burden of building a model family by introducing progressive training via model expansion, initializing each larger model from the previous smaller one. By constraining the total training cost to match the largest model trained from scratch, the method achieves roughly a 25% reduction in FLOPs while delivering comparable or improved performance, especially when applying maximum learning-rate adjustments that scale with model size. Importantly, progressive training also yields greater behavioral consistency across model sizes, as evidenced by lower KL divergences between adjacent models. These findings offer a practical, data-efficient approach to producing coherent, deployable model families suitable for diverse compute budgets and applications.

Abstract

As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.

Paper Structure

This paper contains 29 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of approaches for constructing a model family. (Top): The conventional approach where each model in the family (2B, 4B, 8B) is trained independently from scratch. The total computational cost equals the sum of individual costs for all models. (Bottom): The proposed progressive training utilizes model expansion, where smaller models are expanded to initialize larger ones. The total cost equals only that of the largest model (8B).
  • Figure 2: Train loss curves comparing models trained by Independent versus Progressive approach with the 2x Chinchilla law setting. (Left): Models trained with fixed maximum learning rate, achieving 26% FLOPs reduction. (Right): Models trained with maximum learning rate adjustment, from $1.5 \times 10^{-3}$ (1B) to $3.0 \times 10^{-4}$ (8B), achieving 31% FLOPs reduction.
  • Figure 3: Comparison of token allocation for the 8B model training. (Top): Independent uses a standard approach with 320B tokens. (Middle): Progressive uses 240B tokens (100B + new 140B tokens). (Bottom): Progressive+Fixed Data maintains the same 240B total tokens as Progressive by reusing 140B tokens from previous stages.
  • Figure 4: Training loss curves comparing 8B models trained from scratch with different learning rates. The model trained with a learning rate of $3.0 \times 10^{-4}$ (blue line) shows stable training, while the model trained with a higher learning rate of $1.5 \times 10^{-3}$ (orange line) exhibits severe loss spikes indicating training instability.