Table of Contents
Fetching ...

Understanding Differential Transformer Unchains Pretrained Self-Attentions

Chaerin Kong, Jiho Jang, Nojun Kwak

TL;DR

This work analyzes Differential Transformer, revealing that its success stems from enhanced expressivity via negative attention, reduced head redundancy, and improved learning dynamics. Based on these insights, it introduces Dex, a lightweight architectural extension that reuses pretrained softmax attention scores and applies a differential operation on the output value matrix, with selective head adaptation and lambda annealing to preserve pretrained knowledge. Dex can be integrated into multiple pretrained LLM families with minimal data (<1B tokens) and negligible test-time overhead, yielding significant gains across language modeling, information retrieval, in-context learning, and instruction tuning. The results demonstrate Dex’s practical potential for efficiently upgrading existing pretrained models with the benefits of differential attention while maintaining stability and scalability.

Abstract

Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01%).

Understanding Differential Transformer Unchains Pretrained Self-Attentions

TL;DR

This work analyzes Differential Transformer, revealing that its success stems from enhanced expressivity via negative attention, reduced head redundancy, and improved learning dynamics. Based on these insights, it introduces Dex, a lightweight architectural extension that reuses pretrained softmax attention scores and applies a differential operation on the output value matrix, with selective head adaptation and lambda annealing to preserve pretrained knowledge. Dex can be integrated into multiple pretrained LLM families with minimal data (<1B tokens) and negligible test-time overhead, yielding significant gains across language modeling, information retrieval, in-context learning, and instruction tuning. The results demonstrate Dex’s practical potential for efficiently upgrading existing pretrained models with the benefits of differential attention while maintaining stability and scalability.

Abstract

Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01%).

Paper Structure

This paper contains 45 sections, 6 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Attention score comparison between the two groups in Diff attention. Top 5% refers to top-5% tokens with highest attention score in each sequence. It clearly shows that the overlap between the two attention scores is much greater in non-salient tokens.
  • Figure 2: (a), (b): ratio of attention scores whose absolute value is smaller than $\epsilon$. Except for the bottom layers, Diff Transformer displays lower sparsity ratio. (c), (d): Attention score entropy. Entropy in (c) measures magnitude concentration, calculated on renormalized absolute values of the final differential attention scores. Group refers to the two separate attentions in Diff.
  • Figure 3:
  • Figure 4: Attention scores on Indirect Object Identification (IOI, top two) and sarcastic expression (bottom two). Blue indicates negative and red represents positive. Green boxes highlight the difference.
  • Figure 5: (Left) Pairwise cosine distance between per-head attention scores (flattened across layers) Brighter indicates larger distance, hence less redundancy. (Right) CKA nguyen2020wide between per-head features. Brighter means higher alignment, hence higher redundancy. See Appendix \ref{['supp:2.2']}.
  • ...and 14 more figures