Table of Contents
Fetching ...

Agent Attention: On the Integration of Softmax and Linear Attention

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei Wan, Shiji Song, Gao Huang

TL;DR

Agent Attention introduces a quadruple attention scheme $(Q,A,K,V)$ with a small set of agent tokens $A$ that first aggregate information from $K,V$ and then broadcast to $Q$, achieving near-linear $O(N d)$ complexity while preserving global context. By showing equivalence to a generalized linear attention, the method seamlessly combines the strengths of Softmax and linear attention, and is augmented with an Agent Bias and a Diversity Restoration via Depthwise Convolution to maintain feature diversity. Across ImageNet, COCO, ADE20K, and Stable Diffusion, the approach yields consistent improvements over strong baselines and enables faster diffusion generation without extra training. The work suggests that integrating a compact agent mechanism with global receptive fields can unlock efficient, scalable attention for very long token sequences in vision and beyond.

Abstract

The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.

Agent Attention: On the Integration of Softmax and Linear Attention

TL;DR

Agent Attention introduces a quadruple attention scheme with a small set of agent tokens that first aggregate information from and then broadcast to , achieving near-linear complexity while preserving global context. By showing equivalence to a generalized linear attention, the method seamlessly combines the strengths of Softmax and linear attention, and is augmented with an Agent Bias and a Diversity Restoration via Depthwise Convolution to maintain feature diversity. Across ImageNet, COCO, ADE20K, and Stable Diffusion, the approach yields consistent improvements over strong baselines and enables faster diffusion generation without extra training. The work suggests that integrating a compact agent mechanism with global receptive fields can unlock efficient, scalable attention for very long token sequences in vision and beyond.

Abstract

The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple , introduces an additional set of agent tokens into the conventional attention module. The agent tokens first act as the agent for the query tokens to aggregate information from and , and then broadcast the information back to . Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.
Paper Structure (17 sections, 10 equations, 11 figures, 23 tables)

This paper contains 17 sections, 10 equations, 11 figures, 23 tables.

Figures (11)

  • Figure 1: An illustration of the motivation of our agent attention. (a) In Softmax attention, each query aggregates information from all features, incurring quadratic complexity. (b) Leveraging the redundancy between attention weights, agent attention uses a small number of agent tokens to act as the "agent" for queries, capturing diverse semantic information from all features, and then presenting it to each query. The attention weights are derived from DeiT-T and Agent-DeiT-T.
  • Figure 2: Difference between Softmax attention, Linear attention and Agent attention. (a) Softmax attention computes the similarities between all query-key pairs, resulting in quadratic complexity. (b) Linear attention applies mapping function $\phi(\cdot)$ to $Q$ and $K$ respectively to change the computation order, reducing complexity but suffering from insufficient expressive capability. (c) Our Agent attention employs a small group of agent tokens to aggregate and broadcast global information, leading to an elegant integration of Softmax and linear attention and naturally enjoying the advantages of both high expressiveness and low computation complexity.
  • Figure 3: An illustration of our agent attention and agent attention module. (a) Agent attention uses agent tokens to aggregate global information and distribute it to individual image tokens, resulting in a practical integration of Softmax and linear attention. $\rm{\sigma}(\cdot)$ represents Softmax function. In (b), we depict the information flow of agent attention module. As a showcase, we acquire agent tokens through pooling. Subsequently, agent tokens are utilized to aggregate information from $V$, and $Q$ queries features from the agent features. In addition, agent bias and DWC are adopted to add positional information and maintain feature diversity.
  • Figure 4: Comparison with SOTA models regnetydeitt2tvitcvtcontmvitv2pvtv2focalconvnext on ImageNet-1K.
  • Figure 5: Accuracy-Runtime curve on ImageNet. Runtime is tested with resolution $224^2$.
  • ...and 6 more figures