Static Key Attention in Vision
Zizhao Hu, Xiaolin Zhou, Mohammad Rostami
TL;DR
Static Key Attention (SKA) and Convolutional Static Key Attention (CSKA) replace the dynamic key in Vision Transformer attention with a learned static key, preserving the dynamic attention-value path. Across image classification, object detection, and segmentation, SKA/CSKA match or exceed MHSA performance under certain conditions, especially when used as middle layers in hierarchical MetaFormer-based backbones. The approach reduces key-generation overhead and offers a spectrum of trade-offs between convolutional efficiency and global modeling, demonstrated through extensive ablations and visual analyses of attention maps. The work highlights that dynamic key computation may be unnecessary in vision tasks, with softmax-based attention over static keys achieving competitive expressiveness and practical efficiency.
Abstract
The success of vision transformers is widely attributed to the expressive power of their dynamically parameterized multi-head self-attention mechanism. We examine the impact of substituting the dynamic parameterized key with a static key within the standard attention mechanism in Vision Transformers. Our findings reveal that static key attention mechanisms can match or even exceed the performance of standard self-attention. Integrating static key attention modules into a Metaformer backbone, we find that it serves as a better intermediate stage in hierarchical hybrid architectures, balancing the strengths of depth-wise convolution and self-attention. Experiments on several vision tasks underscore the effectiveness of the static key mechanism, indicating that the typical two-step dynamic parameterization in attention can be streamlined to a single step without impacting performance under certain circumstances.
