Table of Contents
Fetching ...

Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

Anrui Chen, Ruijun Huang, Xin Zhang, Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

TL;DR

It is shown that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route.

Abstract

Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number $N_{eff}$ and find that higher $N_{eff}$ is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.

Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

TL;DR

It is shown that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route.

Abstract

Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number and find that higher is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.
Paper Structure (44 sections, 2 theorems, 46 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 44 sections, 2 theorems, 46 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Lemma 2.1

Fix a route $r$ with composition distribution $p(c\mid r)$ over $c\in\mathcal{C}$, and define Then for any $S\subseteq\mathcal{C}$ with $|S|\le m$,

Figures (8)

  • Figure 1: The router input multiplexes multiple decodable features. (a) Linear probes trained on post-attention states $h_t^{(\ell)}$ predict domain/stance (semantic) and frequency/position (structural) well above chance across layers, showing that these signals co-exist in the same vector. (b) Overlap between probe-induced decoding subspaces is small, indicating that the co-existing signals occupy largely distinct linear directions within $h_t^{(\ell)}$.
  • Figure 2: Feature signals are head-structured but appear mixed in the router input. For each feature $Y$, we estimate head-wise causal importance by ablating one head and measuring the resulting drop in probe accuracy on $h_t^{(\ell)}$. Importance is highly non-uniform across heads and differs by feature, suggesting that feature signals originate in specific heads but are multiplexed in the post-attention router input.
  • Figure 3: Different feature compositions induce distinct gradient directions. Histogram of cosine similarity between composition-conditioned mean gradient directions (Eq. \ref{['eq:mean_dir']}--\ref{['eq:comp_cos']}). Splits of the same composition show high agreement, whereas different compositions concentrate near zero similarity, indicating weak alignment between their learning signals.
  • Figure 4: Composition mixing persists across layers. Old-task mass-weighted average effective composition number $N_{\mathrm{eff}}$ (Eq. \ref{['eq:neff']}) across routes in each MoE layer. Values substantially above $1$ indicate that routes typically aggregate multiple feature compositions under the old-task distribution.
  • Figure 5: More mixed routes forget more. Route-wise old-task loss increase $\Delta L_{\mathrm{old}}$ versus effective composition number $N_{\mathrm{eff}}$ (Eq. \ref{['eq:neff']}). Routes are binned by mass-quantiles of $N_{\mathrm{eff}}$ under old-task routing exposure $\mathrm{mass}_{\mathrm{old}}(r)$, so each bin contains comparable old-task token mass. Points report mean $\Delta L_{\mathrm{old}}$ with standard error, showing a positive association between mixing and forgetting.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Lemma 2.1: Mixing mass bound
  • Theorem 2.2: Composition mixing increases forgetting susceptibility
  • proof : Proof of Lemma \ref{['lem:neff_mass']}
  • proof : Proof of Theorem \ref{['thm:mixing_forgetting_compact']}