Table of Contents
Fetching ...

A Multi-Level Framework for Accelerating Training Transformer Models

Longwei Zou, Han Zhang, Yangdong Deng

TL;DR

Training large transformer models is highly resource-intensive. The authors introduce a multi-level V-cycle framework built on three operators—Coalescing, De-coalescing, and Interpolation—that transfers fast-converging solutions from smaller models to larger ones. They provide formal operator definitions and demonstrate across BERT, GPT, and DeiT that 20–51.6% of training cost can be saved without sacrificing performance. This approach offers a practical path to faster, more energy-efficient pre-training of large-scale transformers, potentially broadening access to state-of-the-art models.

Abstract

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.

A Multi-Level Framework for Accelerating Training Transformer Models

TL;DR

Training large transformer models is highly resource-intensive. The authors introduce a multi-level V-cycle framework built on three operators—Coalescing, De-coalescing, and Interpolation—that transfers fast-converging solutions from smaller models to larger ones. They provide formal operator definitions and demonstrate across BERT, GPT, and DeiT that 20–51.6% of training cost can be saved without sacrificing performance. This approach offers a practical path to faster, more energy-efficient pre-training of large-scale transformers, potentially broadening access to state-of-the-art models.

Abstract

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.
Paper Structure (43 sections, 19 equations, 8 figures, 6 tables, 4 algorithms)

This paper contains 43 sections, 19 equations, 8 figures, 6 tables, 4 algorithms.

Figures (8)

  • Figure 1: Visualization of attention patterns in BERT-Base of a randomly chosen sample sentence. The darker the color, the more attention a token pays to another one. The first row shows the attention patterns of various heads on layer 4 and represents the similarity within a layer. The second row shows the attention patterns for adjacent layers of layer 4, i.e., layer 3 and layer 5, and demonstrates the similarity between layers. The similarities inter and intra-layers offer the potential for accelerating training with multi-level framework.
  • Figure 2: A 2-level V-cycle training process. $M_1$ with parameters of $[{\bm{W}}^1_1, ..., {\bm{W}}^{1}_{L_1}]$ is the original model to train. $M_2$ with parameters of $[{\bm{W}}^2_1, ..., {\bm{W}}^{2}_{L_2}]$ is a smaller model which is coalesced from $M_1$ by coalescing intra- and inter-layer neighboring nodes. ${\bm{F}}^{l}_{in}$ and ${\bm{F}}^{l}_{out}$ are used to decrease the input and output dimension of parameter ${\bm{W}}^{1}_{l}$ in layer $l$. The depth coalescing matrix is decomposed into an $L_{1} \times L_{2}$ matrix and an identity matrix via Kronecker product. We first train the $M_1$ model for $E_a$ epochs to initialize model parameters. Then we train the coalesced $M_2$ model, which converges faster. After that, we de-coalesce the parameters of $M_2$ to the original size and interpolate them with the parameters of $M_1$ before coalescing. Finally, we continue to train the interpolated $M_1$ model.
  • Figure 3: Results on BERT-Base, GPT-Base and BERT-Large. (a-c) show loss curves of BERT-Base, GPT-Base and BERT-Large pre-training. The dashed lines are the final results of models training from scratch. For BERT-Base and GPT-Base, our approach saving about 20% computational costs. For BERT-Large, we save 37.4% training cost with 2-level training process and 51.6% with 3-level.
  • Figure 4: Pre-training GPT-Large mapped once and twice with LiGO. Results show that GPT-Large mapped twice converges significantly slower than GPT-Large mapped once. It confirms that it is not beneficial to monotonically increase the model size as proposed in previous literature.
  • Figure 5: Effect of the Coalescing Operation. In Figure (b), the model corresponds to the GPT-Base before coalescing when the interpolation ratio is set to zero. Conversely, when this ratio is at one, the model becomes equivalent to the de-coalesced model, with or without the application of the coalescing operation.
  • ...and 3 more figures