Table of Contents
Fetching ...

Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang, Yichun Yin, Lifeng Shang, Ming Zhang

Abstract

The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.

Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

Abstract

The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.

Paper Structure

This paper contains 24 sections, 7 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Conventional hybrid transformers (a) often adopt manually designed, static hybrid schemes. In contrast, the proposed SwiAttn (b) dynamically selects the type of attention for each input token at each transformer layer, enabling fine-grained allocation of computation.
  • Figure 2: During the continual pretraining stage (a), we compute both the full attention branch and the SWA branch separately. Subsequently, the router is applied to select the attention output from different branches for each token's hidden representation. The selection result is then processed by a feed-forward network (FFN Layer). During the decoding stage (b), the router is used to decide the branch for attention computation. The full attention branch and the SWA branch share a unified KV cache for better efficiency.
  • Figure 3: Needle-in-a-haystack retrieval accuracy in different context positions (depth percentage) across $32$K context lengths. The proposed SwiAttn achieves perfect retrieval accuracy with its dynamic and fine-grained hybrid attention mechanism.
  • Figure 4: Efficiency comparison under different token positions of SwiAttn (ours) and FullAttn baseline. We report GFLOPs for the prefill stage and memory access in terms of averaged tokens in the decode stage.
  • Figure 5: The ratio of using full attention branches across different transformer layers. The averaged full attention ratio is around $0.13$.
  • ...and 4 more figures