StableMask: Refining Causal Masking in Decoder-only Transformer
Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao, Jianhua Yao, Xiaoyu Shen, Qiang Zhang
TL;DR
StableMask introduces a parameter-free refinement of the decoder-only Transformer causal mask by injecting pseudo-attention and applying a progressively decaying mask ratio. This dual mechanism balances attention distributions to alleviate disproportional attention and enables encoding of absolute positional information, addressing key limitations of Softmax-based attention and RPE. The approach yields theoretical guarantees and empirical gains across 71M to 1.4B parameter models, improves extrapolation with minimal disruption to existing encodings, and integrates with hardware-accelerated attention frameworks like FlashAttention. Practically, StableMask enhances perplexity and downstream task performance, while offering an efficient inference variant (SM-I) that maintains cache-friendly operation and compatible optimization with current Transformer ecosystems.
Abstract
The decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling. Despite its exceptional performance across various tasks, we have identified two limitations: First, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information. This compels the model to assign disproportional excessive attention to specific tokens. Second, RPE-based Transformers are not universal approximators due to their limited capacity at encoding absolute positional information, which limits their application in position-critical tasks. In this work, we propose StableMask: a parameter-free method to address both limitations by refining the causal mask. It introduces pseudo-attention values to balance attention distributions and encodes absolute positional information via a progressively decreasing mask ratio. StableMask's effectiveness is validated both theoretically and empirically, showing significant enhancements in language models with parameter sizes ranging from 71M to 1.4B across diverse datasets and encoding methods. We further show that it naturally supports (1) efficient extrapolation without special tricks such as StreamingLLM and (2) easy integration with existing attention optimization techniques.
