Learning Correlation Structures for Vision Transformers
Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho
TL;DR
The paper targets how vision transformers can better capture relational structure in visual data by learning correlation patterns within key-query interactions. It introduces structural self-attention (StructSA), which uses convolutions on query-key correlations to detect multiple structural patterns and then dynamically aggregates local value contexts through learned pattern detectors and aggregators, forming diverse attention kernels. By embedding StructSA as the core block, the authors build StructViT and demonstrate state-of-the-art performance on both image and video classification benchmarks, including ImageNet-1K and several motion-centric datasets (Kinetics-400, Something-Something V1/V2, Diving-48, FineGym). The results indicate that incorporating structural patterns in attention improves relational reasoning, motion modeling, and scene understanding, offering a scalable, structure-aware alternative to conventional self-attention. The work also provides extensive ablations and visualizations to validate the mechanism and its benefits across dense prediction tasks and downstream applications.
Abstract
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.
