Table of Contents
Fetching ...

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang

TL;DR

This work tackles transformer efficiency by addressing two bottlenecks: normalization and attention. It introduces Progressive Re-parameterized BatchNorm (PRepBN) to gradually replace LayerNorm with a re-parameterized BatchNorm during training, enabling pure BN-based inference after training, and a Simplified Linear Attention (SLA) to replace the costly softmax-based attention with a lightweight, ReLU-based mechanism plus depthwise convolution. The SLAB framework demonstrates strong image classification results with lower latency (e.g., SLAB-Swin-T achieving competitive accuracy at reduced compute) and maintains or improves performance in object detection and language modeling, including improved throughput on LLaMA-350M. Ablation studies confirm that both SLA and PRepBN contribute to the gains, with combined use offering substantial latency reductions and robust accuracy across multiple backbones. The approach provides a practical path toward deployable, efficient transformers on resource-constrained devices while preserving modeling capacity.

Abstract

Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

TL;DR

This work tackles transformer efficiency by addressing two bottlenecks: normalization and attention. It introduces Progressive Re-parameterized BatchNorm (PRepBN) to gradually replace LayerNorm with a re-parameterized BatchNorm during training, enabling pure BN-based inference after training, and a Simplified Linear Attention (SLA) to replace the costly softmax-based attention with a lightweight, ReLU-based mechanism plus depthwise convolution. The SLAB framework demonstrates strong image classification results with lower latency (e.g., SLAB-Swin-T achieving competitive accuracy at reduced compute) and maintains or improves performance in object detection and language modeling, including improved throughput on LLaMA-350M. Ablation studies confirm that both SLA and PRepBN contribute to the gains, with combined use offering substantial latency reductions and robust accuracy across multiple backbones. The approach provides a practical path toward deployable, efficient transformers on resource-constrained devices while preserving modeling capacity.

Abstract

Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains top-1 accuracy on ImageNet-1K with ms latency, which is ms less than that of Flatten-Swin with higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.
Paper Structure (16 sections, 1 theorem, 9 equations, 5 figures, 9 tables)

This paper contains 16 sections, 1 theorem, 9 equations, 5 figures, 9 tables.

Key Result

Lemma 4.1

Denote a BN layer with mean $\mu$, standard deviation $\sigma$, rescale and shift parameters $\alpha$ and $\beta$ as $\mathrm{BN}(X; \mu, \sigma, \alpha, \beta)$. We can re-parameterize the RepBN in Eq. eq:repbn as:

Figures (5)

  • Figure 1: Comparisons of different methods on ImageNet.
  • Figure 2: The overall framework of our proposed Progressive Re-parameterized BatchNorm. (a) During training, we progressively replace LayerNorm with RepBN, which is a new re-parameterization formula of BatchNorm to further improve the performance. (b) We could get $\gamma=0$ during inference, thus the transformer block transits to a RepBN-based architecture, which could further be re-parameterized to BatchNorm and merged with linear layers.
  • Figure 3: Attention map ($196\times196$) from the 4rd block of the model based on DeiT-T. (a) Attention map of DeiT-T is full-rank. (b) With the help of depth-wise convolution, linear attention in Flatten Transformer has a high rank. (c) As simplified linear attention and progressive re-parameterized BatchNorm are applied in transformer, the model still keeps a high rank.
  • Figure 4: Comparisons of accuracy and throughput for different methods on ImageNet1k.
  • Figure 5: Comparison of different normalization on COCO.

Theorems & Definitions (2)

  • Lemma 4.1
  • proof