Table of Contents
Fetching ...

Self-Adjust Softmax

Chuanyang Zheng, Yihang Gao, Guoxuan Chen, Han Shi, Jing Xiong, Xiaozhe Ren, Chao Huang, Xin Jiang, Zhenguo Li, Yu Li

TL;DR

This paper addresses gradient vanishing in softmax-based Transformer attention by introducing Self-Adjusting Softmax (SA-Softmax), which multiplies the softmax output by its input to amplify gradients while preserving the ranking induced by softmax. It offers a normalized variant to stabilize training and preserve interpretability, and provides theoretical gradient analyses showing improved gradient propagation. Empirically, SA-Softmax and its variants yield consistent improvements across position encodings, model sizes, sequence lengths, and downstream tasks, including large-scale pretraining and translation/classification benchmarks, with DAPEV2-Kerple often achieving the strongest gains on long contexts. The work demonstrates SA-Softmax as a drop-in modification to attention mechanisms that enhances scalability and generalization of Transformer models, at the cost of additional computations for min/max normalization in some variants.

Abstract

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x \cdot softmax(x)$ and its normalized variant $\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$. We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.

Self-Adjust Softmax

TL;DR

This paper addresses gradient vanishing in softmax-based Transformer attention by introducing Self-Adjusting Softmax (SA-Softmax), which multiplies the softmax output by its input to amplify gradients while preserving the ranking induced by softmax. It offers a normalized variant to stabilize training and preserve interpretability, and provides theoretical gradient analyses showing improved gradient propagation. Empirically, SA-Softmax and its variants yield consistent improvements across position encodings, model sizes, sequence lengths, and downstream tasks, including large-scale pretraining and translation/classification benchmarks, with DAPEV2-Kerple often achieving the strongest gains on long contexts. The work demonstrates SA-Softmax as a drop-in modification to attention mechanisms that enhances scalability and generalization of Transformer models, at the cost of additional computations for min/max normalization in some variants.

Abstract

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying to and its normalized variant . We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.

Paper Structure

This paper contains 50 sections, 11 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The loss difference and gradient difference between our methods and baseline.
  • Figure 2: The visualization of attention output, from left to right: 1) $softmax(x)$; 2) $x*softmax(x)$; 3) $\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$.
  • Figure 3: The visualization of attention output, from left to right: 1) $softmax(x)$; 2) $x*softmax(x)$; 3) $\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$.
  • Figure 4: The visualization of attention output, from left to right: 1) $softmax(x)$; 2) $x*softmax(x)$; 3) $\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$.