Table of Contents
Fetching ...

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

TL;DR

SwitchHead introduces a Mixture-of-Experts approach to the Transformer attention layer, drastically reducing the number of attention matrices computed by allowing multiple expert projections for values and outputs while sharing keys and queries. By using a non-competitive sigmoid selection, SwitchHead computes attention with significantly fewer active heads and can be combined with MoE-based MLP layers to form SwitchAll. Empirical results across multiple datasets and model sizes show SwitchHead achieves perplexity comparable to parameter-matched dense Transformers with much lower compute and memory usage, and SwitchAll often surpasses baselines under the same parameter budgets. The work demonstrates stable training without extra regularizers and provides insights into attention map redundancy and interpretable expert selections, with practical implications for resource-constrained deployment and scalable language modeling.

Abstract

Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

TL;DR

SwitchHead introduces a Mixture-of-Experts approach to the Transformer attention layer, drastically reducing the number of attention matrices computed by allowing multiple expert projections for values and outputs while sharing keys and queries. By using a non-competitive sigmoid selection, SwitchHead computes attention with significantly fewer active heads and can be combined with MoE-based MLP layers to form SwitchAll. Empirical results across multiple datasets and model sizes show SwitchHead achieves perplexity comparable to parameter-matched dense Transformers with much lower compute and memory usage, and SwitchAll often surpasses baselines under the same parameter budgets. The work demonstrates stable training without extra regularizers and provides insights into attention map redundancy and interpretable expert selections, with practical implications for resource-constrained deployment and scalable language modeling.

Abstract

Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.
Paper Structure (25 sections, 9 equations, 6 figures, 10 tables)

This paper contains 25 sections, 9 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: A schematic representation of SwitchHead. It consists of a few independent heads, each with multiple experts for value and output projections. Each head has a single attention matrix.
  • Figure 2: An attention map of the (a) standard Transformer and (b) SwitchHead. The maximum of all heads in the given layer are shown.
  • Figure 3: The maximum of all attention maps for a SwitchHead model on ListOps.
  • Figure 4: The maximum of all attention maps for a standard Transformer model on ListOps.
  • Figure 5: Details for individual heads of the SwitchHead model on ListOps. On the left side of each attention plot, the selection of the output projection expert is shown. Similarly, at the bottom, the selection of the value projection selection is visible. In the selection maps, dark blue always corresponds to 1, while white is 0. The adaptive scale shown to the right of the attention map is for the map only.
  • ...and 1 more figures