Table of Contents
Fetching ...

A multilevel approach to accelerate the training of Transformers

Guillaume Lauga, Maël Chaumette, Edgar Desainte-Maréville, Étienne Lasalle, Arthur Lebeurrier

TL;DR

The paper addresses the high computational cost of training deep Transformer models by adopting an ODE interpretation of Transformer decoders and introducing a two-level multilevel training strategy. By halving depth to form coarse models and prolongating their learned parameters back to a fine model, the approach accelerates optimization without resorting to width/depth operators. Empirical results on a sequence-generation task show a 44% reduction in FLOPs while achieving the same training loss as standard single-level training, though challenges remain in momentum interaction and scaling to larger architectures. The work highlights the potential of discretization-based multilevel schemes for efficient transformer training and outlines avenues for further theoretical and empirical development.

Abstract

In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate way of varying the discretization of these ODE Transformers in order to accelerate the training. We validate our approach experimentally by a comparison with the standard training procedure.

A multilevel approach to accelerate the training of Transformers

TL;DR

The paper addresses the high computational cost of training deep Transformer models by adopting an ODE interpretation of Transformer decoders and introducing a two-level multilevel training strategy. By halving depth to form coarse models and prolongating their learned parameters back to a fine model, the approach accelerates optimization without resorting to width/depth operators. Empirical results on a sequence-generation task show a 44% reduction in FLOPs while achieving the same training loss as standard single-level training, though challenges remain in momentum interaction and scaling to larger architectures. The work highlights the potential of discretization-based multilevel schemes for efficient transformer training and outlines avenues for further theoretical and empirical development.

Abstract

In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate way of varying the discretization of these ODE Transformers in order to accelerate the training. We validate our approach experimentally by a comparison with the standard training procedure.

Paper Structure

This paper contains 18 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: Training loss of the single level algorithm (standard method) in blue and of the multilevel algorithm (our proposed approach) in red with respect to the number of optimization steps (at fine level). The curves are averaged over $6$ seeds. In lighter colors we display the standard deviation of the $6$ training runs.
  • Figure 2: Training loss of the single level algorithm (standard method) in blue and of the multilevel algorithm (our proposed approach) in red with respect to FLOPS. The curves are averaged over $6$ seeds. In lighter colors we display the standard deviation of the $6$ training runs.
  • Figure 3: Scheme of our proposed approach on a network with $4$ transformer layers. The fine level network (blue blocks) is decomposed into two coarser networks (red blocks) that contains even-indexed (resp. odd-indexed) layers. Input and output layers (gray blocks) are shared across all networks.