Table of Contents
Fetching ...

Scalable-Softmax Is Superior for Attention

Ken M. Nakanishi

TL;DR

Softmax attention suffers from fading as context length grows, hindering long-context capabilities. The authors propose Scalable-Softmax (SSMax), where attention scores are computed with a size-aware base, $z_i \mapsto \frac{n^{sz_i}}{\sum_j n^{sz_j}} = \frac{e^{(s\log n)z_i}}{\sum_j e^{(s\log n)z_j}}$, controlled by a per-head/layer scale $s$, enabling robust focus on key tokens across varying context sizes. Across a 162M-parameter Transformer, SSMax yields faster pretraining loss reduction, stronger generalization to longer contexts, and superior key-information retrieval, especially when used from the start of pretraining, while still offering gains when introduced mid-pretraining or post-pretraining. The results indicate SSMax can be integrated with minimal changes, suggesting it as a practical replacement for Softmax in Transformer attention to enhance length generalization and retrieval in long texts.

Abstract

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.

Scalable-Softmax Is Superior for Attention

TL;DR

Softmax attention suffers from fading as context length grows, hindering long-context capabilities. The authors propose Scalable-Softmax (SSMax), where attention scores are computed with a size-aware base, , controlled by a per-head/layer scale , enabling robust focus on key tokens across varying context sizes. Across a 162M-parameter Transformer, SSMax yields faster pretraining loss reduction, stronger generalization to longer contexts, and superior key-information retrieval, especially when used from the start of pretraining, while still offering gains when introduced mid-pretraining or post-pretraining. The results indicate SSMax can be integrated with minimal changes, suggesting it as a practical replacement for Softmax in Transformer attention to enhance length generalization and retrieval in long texts.

Abstract

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.

Paper Structure

This paper contains 16 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of Softmax and SSMax, illustrating the issue of attention fading and the effectiveness of SSMax in preventing it. As the input vector size increases, the maximum value of the output vector produced by Softmax decreases, demonstrating the problem of attention fading. In contrast, SSMax keeps the maximum value close to 1, regardless of the input size. The input vector consists of -2 for all elements except the last, which is set to +3. The scaling parameter $s$ of SSMax is set to 0.43.
  • Figure 2: Relationship between $p_n$ and the input vector size $n$. The red dots represent the learned values of $p_n$ after training, and the blue curve is a fitted logarithmic function of the form $p_n \approx a_1 \log n + a_2$. This result suggests that $p_n$ depends logarithmically on $n$, motivating the reformulation of Softmax in \ref{['eq:ssmax_with_bias']}.
  • Figure 3: An example illustrating the behavior of Softmax and SSMax for an input vector of size $n$ given by $(0, \frac{1}{n-2}, \frac{2}{n-2}, \dots, \frac{n-1}{n-2}, 1, z_\mathrm{max})$. The horizontal axis represents the value of $z_\mathrm{max}$, while the vertical axis represents its transformed value. The red and orange lines correspond to SSMax with different scaling parameters $s$, and the blue lines correspond to Softmax, with line styles indicating different input vector sizes. This figure demonstrates that, under Softmax, the value of $z_\mathrm{max}$ required to focus attention increases indefinitely as $n$ grows. In contrast, SSMax ensures that attention is focused as long as $z_\mathrm{max}$ exceeds the other values by approximately $\frac{1}{s}$, regardless of $n$.
  • Figure 4: Learning curves comparing the standard Transformer (a) and SSMax variants (b)--(d). All SSMax variants achieve consistently lower training loss compared to (a). Among them, the model with SSMax incorporating a bias parameter (d) exhibits the lowest loss throughout training. The results also indicate that removing the scaling parameter, as in (c), has little impact on the learning curve compared to (b).
  • Figure 5: Per-position test loss across context sizes up to 20,000. The x-axis represents context size, and the y-axis represents test loss. RoPE's $\theta$ was set to 50 times the training value, with no additional training after modification. The gray dotted line indicates the training sequence length of 1024. Results correspond to configurations (a)--(f). SSMax models (b) and (c) demonstrate improved long-context generalization compared to (a), while (d) exhibits degraded performance due to the bias parameter. Model (e), where Softmax was replaced with SSMax post-training, struggles with shorter contexts, whereas (f), which switched to SSMax during the final phase of pretraining, achieves performance somewhat close to (b), though not entirely equivalent.
  • ...and 3 more figures