Table of Contents
Fetching ...

Differential Transformer

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei

TL;DR

The paper tackles attention noise in Transformer architectures by introducing Diff Transformer, which uses differential attention that subtracts two softmax maps to emphasize relevant context. This approach, implemented in a multi-head framework with a learnable lambda and normalization strategies, yields improved data efficiency, long-context utilization, key information retrieval, and reduced contextual hallucinations, while also enhancing robustness in in-context learning and lowering activation outliers. Extensive experiments demonstrate favorable scaling properties and strong performance across language modeling and downstream tasks, including 64K context lengths. The method remains compatible with existing training pipelines and hardware accelerators, and it opens avenues for low-bit attention kernels and cache compression due to sparser attention patterns.

Abstract

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

Differential Transformer

TL;DR

The paper tackles attention noise in Transformer architectures by introducing Diff Transformer, which uses differential attention that subtracts two softmax maps to emphasize relevant context. This approach, implemented in a multi-head framework with a learnable lambda and normalization strategies, yields improved data efficiency, long-context utilization, key information retrieval, and reduced contextual hallucinations, while also enhancing robustness in in-context learning and lowering activation outliers. Extensive experiments demonstrate favorable scaling properties and strong performance across language modeling and downstream tasks, including 64K context lengths. The method remains compatible with existing training pipelines and hardware accelerators, and it opens avenues for low-bit attention kernels and cache compression due to sparser attention patterns.

Abstract

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.
Paper Structure (22 sections, 8 equations, 11 figures, 10 tables)

This paper contains 22 sections, 8 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Transformer often over-attends to irrelevant context (i.e., attention noise). Diff Transformer amplifies attention to answer spans and cancels noise, enhancing the capability of context modeling.
  • Figure 2: Multi-head differential attention. Each head takes the difference between two $\mathrm{softmax}$ attention maps to cancel out attention noise. $\lambda$ is a learnable scalar that is initialized to $\lambda_\text{init}$. $\operatorname{GroupNorm}$ applies normalization to each head independently. A fixed multiplier $(1 - \lambda_{\text{init}})$ is used after $\operatorname{GroupNorm}$, which aligns the gradient flow with Transformer. The code implementation is available at https://aka.ms/Diff-Transformer.
  • Figure 3: Language modeling loss of scaling up parameter count and training tokens. Diff Transformer requires only about 65% of model size or training tokens to match Transformer's performance.
  • Figure 4: Cumulative average negative log-likelihood (lower is better) on book data. Diff Transformer leverages long context more effectively.
  • Figure 5: Multi-needle retrieval results in 64k length.
  • ...and 6 more figures