More Expressive Attention with Negative Weights
Ang Lv, Ruobing Xie, Shuaipeng Li, Jiayi Liao, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan
TL;DR
Cog Attention removes the traditional non-negativity constraint on attention weights by using SignExp-based normalization, enabling negative weights derived from dynamic query–key inner products $p_i = q_i k^T$. This shift allows deletion, copying, and refinement within a single head, improving expressiveness and mitigating over-squashing that leads to representational collapse. Empirically, decoder-only language models (e.g., Cogformer at 141M and larger scales) and image-generation diffusion models (U-ViT → U-ViC) show improved performance and generation fidelity, while preserving training stability through selective use of softmax in early and/or late layers. Together, these results suggest a promising direction for relaxing non-negativity constraints in attention, with potential impact on large-scale language and vision models.
Abstract
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
