Table of Contents
Fetching ...

Grouped Differential Attention

Junghwan Lim, Sungmin Lee, Dongseok Kim, Wai Ting Cheung, Beomgyu Kim, Taehwan Kim, Haesol Lee, Junhyeok Lee, Dongpin Oh, Eunhwan Park

TL;DR

Grouped Differential Attention (GDA) addresses the inefficiency of symmetric attention by allocating more heads to signal-preserving computation and fewer to noise-control, stabilizing the latter via repetition akin to Grouped Query Attention. It further introduces group-differentiated growth to scale capacity by selectively expanding only the signal-focused heads. Through large-scale pretraining and progressive continual training, moderate imbalance ratios such as $3:1$ or $4:1$ improve generalization and training stability under a fixed FLOPs budget compared with symmetric baselines. These findings offer a practical, scalable path toward computation-efficient Transformers with enhanced signal fidelity and noise suppression.

Abstract

The self-attention mechanism, while foundational to modern Transformer architectures, suffers from a critical inefficiency: it frequently allocates substantial attention to redundant or noisy context. Differential Attention addressed this by using subtractive attention maps for signal and noise, but its required balanced head allocation imposes rigid constraints on representational flexibility and scalability. To overcome this, we propose Grouped Differential Attention (GDA), a novel approach that introduces unbalanced head allocation between signal-preserving and noise-control groups. GDA significantly enhances signal focus by strategically assigning more heads to signal extraction and fewer to noise-control, stabilizing the latter through controlled repetition (akin to GQA). This design achieves stronger signal fidelity with minimal computational overhead. We further extend this principle to group-differentiated growth, a scalable strategy that selectively replicates only the signal-focused heads, thereby ensuring efficient capacity expansion. Through large-scale pretraining and continual training experiments, we demonstrate that moderate imbalance ratios in GDA yield substantial improvements in generalization and stability compared to symmetric baselines. Our results collectively establish that ratio-aware head allocation and selective expansion offer an effective and practical path toward designing scalable, computation-efficient Transformer architectures.

Grouped Differential Attention

TL;DR

Grouped Differential Attention (GDA) addresses the inefficiency of symmetric attention by allocating more heads to signal-preserving computation and fewer to noise-control, stabilizing the latter via repetition akin to Grouped Query Attention. It further introduces group-differentiated growth to scale capacity by selectively expanding only the signal-focused heads. Through large-scale pretraining and progressive continual training, moderate imbalance ratios such as or improve generalization and training stability under a fixed FLOPs budget compared with symmetric baselines. These findings offer a practical, scalable path toward computation-efficient Transformers with enhanced signal fidelity and noise suppression.

Abstract

The self-attention mechanism, while foundational to modern Transformer architectures, suffers from a critical inefficiency: it frequently allocates substantial attention to redundant or noisy context. Differential Attention addressed this by using subtractive attention maps for signal and noise, but its required balanced head allocation imposes rigid constraints on representational flexibility and scalability. To overcome this, we propose Grouped Differential Attention (GDA), a novel approach that introduces unbalanced head allocation between signal-preserving and noise-control groups. GDA significantly enhances signal focus by strategically assigning more heads to signal extraction and fewer to noise-control, stabilizing the latter through controlled repetition (akin to GQA). This design achieves stronger signal fidelity with minimal computational overhead. We further extend this principle to group-differentiated growth, a scalable strategy that selectively replicates only the signal-focused heads, thereby ensuring efficient capacity expansion. Through large-scale pretraining and continual training experiments, we demonstrate that moderate imbalance ratios in GDA yield substantial improvements in generalization and stability compared to symmetric baselines. Our results collectively establish that ratio-aware head allocation and selective expansion offer an effective and practical path toward designing scalable, computation-efficient Transformer architectures.

Paper Structure

This paper contains 17 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the Grouped Differential Attention (GDA). Unlike the symmetric Differential Transformer, GDA allocates heads unevenly across signal and noise groups, sharing a smaller set of noise-control heads among multiple query groups.
  • Figure 2: Comparison of group-differentiated allocation and performance. (a) Signal and noise head allocation under different $G{:}1$ ratios for $H=48$, where $(G+1)$ divides $H$. (b) Relative performance change (percentage $\Delta$) compared to the $1{:}1$ baseline.