Table of Contents
Fetching ...

LazyFormer: Self Attention with Lazy Update

Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu

TL;DR

The paper tackles the efficiency bottleneck of self-attention in Transformer pre-training by introducing LazyFormer, which partitions depth into lazy blocks and computes the attention distribution only in the first layer of each block, reusing it for subsequent layers. This reduces the attention cost from $O(kn^2)$ to $O(kn^2/m)$, enabling faster pre-training and the possibility to train wider models within the same budget. The authors also explore Wider Layers and dropout removal to boost efficiency, and show through extensive experiments that LazyFormer achieves about 1.3x speedups with maintained or improved GLUE performance, with larger gains for longer sequences. The approach offers practical impact by enabling more scalable pre-training and better utilization of computational resources in NLP models.

Abstract

Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method.

LazyFormer: Self Attention with Lazy Update

TL;DR

The paper tackles the efficiency bottleneck of self-attention in Transformer pre-training by introducing LazyFormer, which partitions depth into lazy blocks and computes the attention distribution only in the first layer of each block, reusing it for subsequent layers. This reduces the attention cost from to , enabling faster pre-training and the possibility to train wider models within the same budget. The authors also explore Wider Layers and dropout removal to boost efficiency, and show through extensive experiments that LazyFormer achieves about 1.3x speedups with maintained or improved GLUE performance, with larger gains for longer sequences. The approach offers practical impact by enabling more scalable pre-training and better utilization of computational resources in NLP models.

Abstract

Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method.

Paper Structure

This paper contains 17 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The basic block in LazyFormer.
  • Figure 2: Both M2x6-S and M2x6 converge much faster than the baselines. Besides, M2x6 achieves better performance in downstream tasks while using much fewer pre-training costs.
  • Figure 3: Speedup ratio of LazyFormer under different settings of lazy blocks.