Table of Contents
Fetching ...

More Expressive Attention with Negative Weights

Ang Lv, Ruobing Xie, Shuaipeng Li, Jiayi Liao, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

TL;DR

Cog Attention removes the traditional non-negativity constraint on attention weights by using SignExp-based normalization, enabling negative weights derived from dynamic query–key inner products $p_i = q_i k^T$. This shift allows deletion, copying, and refinement within a single head, improving expressiveness and mitigating over-squashing that leads to representational collapse. Empirically, decoder-only language models (e.g., Cogformer at 141M and larger scales) and image-generation diffusion models (U-ViT → U-ViC) show improved performance and generation fidelity, while preserving training stability through selective use of softmax in early and/or late layers. Together, these results suggest a promising direction for relaxing non-negativity constraints in attention, with potential impact on large-scale language and vision models.

Abstract

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.

More Expressive Attention with Negative Weights

TL;DR

Cog Attention removes the traditional non-negativity constraint on attention weights by using SignExp-based normalization, enabling negative weights derived from dynamic query–key inner products . This shift allows deletion, copying, and refinement within a single head, improving expressiveness and mitigating over-squashing that leads to representational collapse. Empirically, decoder-only language models (e.g., Cogformer at 141M and larger scales) and image-generation diffusion models (U-ViT → U-ViC) show improved performance and generation fidelity, while preserving training stability through selective use of softmax in early and/or late layers. Together, these results suggest a promising direction for relaxing non-negativity constraints in attention, with potential impact on large-scale language and vision models.

Abstract

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.

Paper Structure

This paper contains 19 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: In the Indirect Object Identification (IOI) task wang2023interpretability, a language model should identify the indirect object ($IO$) from a context that includes both the $IO$ and a subject ($S$). Figures (a) and (b) illustrate how Cog Attention and softmax attention perform IOI through a process of elimination: a softmax attention head with a deletion-function OV matrix eliminates all attended tokens. While the $IO$ token receives less attention than $S$, it is also deleted. In contrast, Cog Attention shifts functions like deletion or copying from a static OV matrix to dynamic query-key inner products, allowing the head to assign negative weights to $S$ tokens for elimination while preserving the $IO$s. Figures (c) and (d) show attention weights on names versus the direction of the heads' output across the entire dataset. Cog Attention preserves the $IO$s better. For further details, please see Section \ref{['sec:mechanism']}.
  • Figure 2: The subtraction of the maximum absolute value from a row of query-key inner products avoids numerical overflow. Meanwhile, this approach maintains the relative importance of negative and positive inner products in the final attention weights.
  • Figure 3: A naive implementation of Cog Attention in Pytorch, alongside an equivalent yet faster implementation. By writing a fused kernel in Triton 10.1145/3315508.3329973, Cog Attention achieves the same efficiency as softmax attention.
  • Figure 4: Two tasks for evaluating the extent of representational collapse in language models.
  • Figure 5: The output representation difference are measured for Transformer language models utilizing Cog Attention (subfigures (a) and (b)) and softmax attention subfigures (c) and (d)), respectively. Cog Attention enhances the robustness of language models against representational collapse.
  • ...and 6 more figures