Table of Contents
Fetching ...

Static Key Attention in Vision

Zizhao Hu, Xiaolin Zhou, Mohammad Rostami

TL;DR

Static Key Attention (SKA) and Convolutional Static Key Attention (CSKA) replace the dynamic key in Vision Transformer attention with a learned static key, preserving the dynamic attention-value path. Across image classification, object detection, and segmentation, SKA/CSKA match or exceed MHSA performance under certain conditions, especially when used as middle layers in hierarchical MetaFormer-based backbones. The approach reduces key-generation overhead and offers a spectrum of trade-offs between convolutional efficiency and global modeling, demonstrated through extensive ablations and visual analyses of attention maps. The work highlights that dynamic key computation may be unnecessary in vision tasks, with softmax-based attention over static keys achieving competitive expressiveness and practical efficiency.

Abstract

The success of vision transformers is widely attributed to the expressive power of their dynamically parameterized multi-head self-attention mechanism. We examine the impact of substituting the dynamic parameterized key with a static key within the standard attention mechanism in Vision Transformers. Our findings reveal that static key attention mechanisms can match or even exceed the performance of standard self-attention. Integrating static key attention modules into a Metaformer backbone, we find that it serves as a better intermediate stage in hierarchical hybrid architectures, balancing the strengths of depth-wise convolution and self-attention. Experiments on several vision tasks underscore the effectiveness of the static key mechanism, indicating that the typical two-step dynamic parameterization in attention can be streamlined to a single step without impacting performance under certain circumstances.

Static Key Attention in Vision

TL;DR

Static Key Attention (SKA) and Convolutional Static Key Attention (CSKA) replace the dynamic key in Vision Transformer attention with a learned static key, preserving the dynamic attention-value path. Across image classification, object detection, and segmentation, SKA/CSKA match or exceed MHSA performance under certain conditions, especially when used as middle layers in hierarchical MetaFormer-based backbones. The approach reduces key-generation overhead and offers a spectrum of trade-offs between convolutional efficiency and global modeling, demonstrated through extensive ablations and visual analyses of attention maps. The work highlights that dynamic key computation may be unnecessary in vision tasks, with softmax-based attention over static keys achieving competitive expressiveness and practical efficiency.

Abstract

The success of vision transformers is widely attributed to the expressive power of their dynamically parameterized multi-head self-attention mechanism. We examine the impact of substituting the dynamic parameterized key with a static key within the standard attention mechanism in Vision Transformers. Our findings reveal that static key attention mechanisms can match or even exceed the performance of standard self-attention. Integrating static key attention modules into a Metaformer backbone, we find that it serves as a better intermediate stage in hierarchical hybrid architectures, balancing the strengths of depth-wise convolution and self-attention. Experiments on several vision tasks underscore the effectiveness of the static key mechanism, indicating that the typical two-step dynamic parameterization in attention can be streamlined to a single step without impacting performance under certain circumstances.

Paper Structure

This paper contains 38 sections, 7 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: The standard self-attention mechanism in (left) Vision Transformers and (right) Static Key Attention, where it replaces the linear layer generating the key tensor with a fixed, learnable key matrix, removing the need for extra transformations.
  • Figure 2: The Static Key Attention (SKA) module and Convolutional Static Key Attention module (CSKA) we proposed are used under the MetaFormer framework. They serve as new types of token mixers, while the channel mixer is an MLP which is identical to other MetaFormer modules, such as the Vision Transformer. They can also be adopted in other architectures.
  • Figure 3: Impact of number of heads (groups) on performance and speed after 160 epochs (full training 310 epochs) training on ImageNet.
  • Figure 4: The FLOPs/Parameters ratio vs. spatial dimension (left) and embedding dimension (right). The embedding dimension is set to constant 256 for the left figure, and the spatial dimension is set to 256 for the right figure.
  • Figure 5: Comparison of SKA and CSKA againt HMSA in terms of interpretability: average attention map over all heads of a reference image at different layers. We choose stage 3 (layers 7 to 15) in the caformer-s18 architecture as the replacement target. Each row corresponds to no replacement (MHSA), replaced with the static key attention (SKA), and the convolutional static key attention (CSKA). SKA and CSKA exhibit vastly different behavior than the MHSA, despite a small modification to the architecture.