Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae; Baeseong Park; Gunho Park; Minsub Kim; Joonhyung Lee; Junhee Yoo; Sunghyeon Woo; Jiwon Ryu; Se Jung Kwon; Dongsoo Lee

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee, Junhee Yoo, Sunghyeon Woo, Jiwon Ryu, Se Jung Kwon, Dongsoo Lee

TL;DR

Affine-Scaled Attention is proposed, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner.

Abstract

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

TL;DR

Abstract

Paper Structure (33 sections, 11 equations, 11 figures, 5 tables)

This paper contains 33 sections, 11 equations, 11 figures, 5 tables.

Introduction
Background
LLM training
Softmax attention
Attention sink
Gated Attention
Methodology
Motivation
Affine-Scaled Attention
Attention allocation dynamics
Attention weights
First token analysis
Head-wise allocation dynamics
Attention entropy
Experiments
...and 18 more sections

Figures (11)

Figure 1: Comparison of baseline softmax, attention sink, and Affine-Scaled Attention. Attention skew toward the first token (red) is progressively reduced. Here, $i$ denotes the index over key tokens, and the attention weight distributions are illustrated for a single query.
Figure 2: Training dynamics of attention logits. $QK^T$ logits mean over training steps for the 3B baseline and attention sink models, averaged across layers, heads, and token positions.
Figure 3: Per-query attention weight sum distributions for (a) Attention sink and (b) Affine-Scaled Attention. Dashed lines indicate the mean and median.
Figure 4: Per-layer first-token attention weights for the 1B model.
Figure 5: Layer-wise and head-wise heatmaps for the 3B model showing effective attention reweighting across methods.
...and 6 more figures

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

TL;DR

Abstract

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (11)