A multilevel approach to accelerate the training of Transformers
Guillaume Lauga, Maël Chaumette, Edgar Desainte-Maréville, Étienne Lasalle, Arthur Lebeurrier
TL;DR
The paper addresses the high computational cost of training deep Transformer models by adopting an ODE interpretation of Transformer decoders and introducing a two-level multilevel training strategy. By halving depth to form coarse models and prolongating their learned parameters back to a fine model, the approach accelerates optimization without resorting to width/depth operators. Empirical results on a sequence-generation task show a 44% reduction in FLOPs while achieving the same training loss as standard single-level training, though challenges remain in momentum interaction and scaling to larger architectures. The work highlights the potential of discretization-based multilevel schemes for efficient transformer training and outlines avenues for further theoretical and empirical development.
Abstract
In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate way of varying the discretization of these ODE Transformers in order to accelerate the training. We validate our approach experimentally by a comparison with the standard training procedure.
