Table of Contents
Fetching ...

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee, Junhee Yoo, Sunghyeon Woo, Jiwon Ryu, Se Jung Kwon, Dongsoo Lee

TL;DR

Affine-Scaled Attention is proposed, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner.

Abstract

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

TL;DR

Affine-Scaled Attention is proposed, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner.

Abstract

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.
Paper Structure (33 sections, 11 equations, 11 figures, 5 tables)

This paper contains 33 sections, 11 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison of baseline softmax, attention sink, and Affine-Scaled Attention. Attention skew toward the first token (red) is progressively reduced. Here, $i$ denotes the index over key tokens, and the attention weight distributions are illustrated for a single query.
  • Figure 2: Training dynamics of attention logits. $QK^T$ logits mean over training steps for the 3B baseline and attention sink models, averaged across layers, heads, and token positions.
  • Figure 3: Per-query attention weight sum distributions for (a) Attention sink and (b) Affine-Scaled Attention. Dashed lines indicate the mean and median.
  • Figure 4: Per-layer first-token attention weights for the 1B model.
  • Figure 5: Layer-wise and head-wise heatmaps for the 3B model showing effective attention reweighting across methods.
  • ...and 6 more figures