LazyFormer: Self Attention with Lazy Update
Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu
TL;DR
The paper tackles the efficiency bottleneck of self-attention in Transformer pre-training by introducing LazyFormer, which partitions depth into lazy blocks and computes the attention distribution only in the first layer of each block, reusing it for subsequent layers. This reduces the attention cost from $O(kn^2)$ to $O(kn^2/m)$, enabling faster pre-training and the possibility to train wider models within the same budget. The authors also explore Wider Layers and dropout removal to boost efficiency, and show through extensive experiments that LazyFormer achieves about 1.3x speedups with maintained or improved GLUE performance, with larger gains for longer sequences. The approach offers practical impact by enabling more scalable pre-training and better utilization of computational resources in NLP models.
Abstract
Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method.
