Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Jason Ramapuram; Federico Danieli; Eeshan Dhekane; Floris Weers; Dan Busbridge; Pierre Ablin; Tatiana Likhomanenko; Jagrit Digani; Zijin Gu; Amitis Shidani; Russ Webb

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb

TL;DR

<3-5 sentence high-level summary>The paper investigates using sigmoid attention as a drop-in replacement for softmax attention in transformers, addressing softmax limitations such as attention concentration and row-wise normalization. It provides a theoretical foundation by proving a Universal Approximation Property for SigmoidAttn and deriving a Lipschitz-type regularity bound that improves over softmax, along with a hardware-aware implementation (FlashSigmoid) that yields substantial inference-speedups. Empirically, SigmoidAttn matches SoftmaxAttn across language, vision, and speech tasks when properly regularized (notably with bias b ≈ -log n and Hybrid-Norm or LayerScale) and benefits from FlashSigmoid in end-to-end training and inference. The work unifies prior sigmoid attention efforts, offers practical best-practices, and demonstrates meaningful efficiency gains for large-scale transformers.

Abstract

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

TL;DR

Abstract

Paper Structure (91 sections, 9 theorems, 80 equations, 55 figures, 18 tables, 2 algorithms)

This paper contains 91 sections, 9 theorems, 80 equations, 55 figures, 18 tables, 2 algorithms.

Introduction
Sigmoid Attention
Theoretical Properties of Sigmoid Attention
Are Transformers with Sigmoid Attention Universal Approximators?
Sigmoid Attention layers can implement contextual mappings:
Regularity of Sigmoid Attention
Computational Complexity of Sigmoid and Softmax.
FlashSigmoid: Hardware-Aware Implementation
Tiling: Divide and Conquer Approach to Attention:
Kernel Fusion:
Activation Recomputation:
Experiments
Ablations
Mitigating Large Attention Norms
Empirical Analysis of Attention Regularity
...and 76 more sections

Key Result

Theorem 3.1

We denote with $\mathcal{T}^{h,d_v,r}_{\sigma}$ the class of transformer networks obtainable by combining an arbitrary number of $\mathrm{SigmoidAttn}$ layers (each of $h$ heads of dimension $d_v$) followed by FFN layers of hidden dimension $r$. For any given continuous, permutation-equivariant func

Figures (55)

Figure 2: Train losses comparing $\mathrm{SigmoidAttn}$ with $\mathrm{SoftmaxAttn}$.
Figure 3: $\mathrm{SigmoidAttn}$ with SinCos.
Figure 4: $\mathrm{SigmoidAttn}$ with RoPE.
Figure 5: $\mathrm{SigmoidAttn}$ with ALiBi.
Figure 6: $\mathrm{SigmoidAttn}$ with RoPE, $b=-10$.
...and 50 more figures

Theorems & Definitions (18)

Theorem 3.1: UAP for $\mathrm{SigmoidAttn}$
Theorem 3.2
Definition C.1: Modified sigmoid self-attention layer
Definition C.2: Contextual mapping
Lemma C.3
proof
Lemma C.4
proof
Lemma C.5
proof
...and 8 more

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

TL;DR

Abstract

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (55)

Theorems & Definitions (18)