Table of Contents
Fetching ...

Learning Correlation Structures for Vision Transformers

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

TL;DR

The paper targets how vision transformers can better capture relational structure in visual data by learning correlation patterns within key-query interactions. It introduces structural self-attention (StructSA), which uses convolutions on query-key correlations to detect multiple structural patterns and then dynamically aggregates local value contexts through learned pattern detectors and aggregators, forming diverse attention kernels. By embedding StructSA as the core block, the authors build StructViT and demonstrate state-of-the-art performance on both image and video classification benchmarks, including ImageNet-1K and several motion-centric datasets (Kinetics-400, Something-Something V1/V2, Diving-48, FineGym). The results indicate that incorporating structural patterns in attention improves relational reasoning, motion modeling, and scene understanding, offering a scalable, structure-aware alternative to conventional self-attention. The work also provides extensive ablations and visualizations to validate the mechanism and its benefits across dense prediction tasks and downstream applications.

Abstract

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

Learning Correlation Structures for Vision Transformers

TL;DR

The paper targets how vision transformers can better capture relational structure in visual data by learning correlation patterns within key-query interactions. It introduces structural self-attention (StructSA), which uses convolutions on query-key correlations to detect multiple structural patterns and then dynamically aggregates local value contexts through learned pattern detectors and aggregators, forming diverse attention kernels. By embedding StructSA as the core block, the authors build StructViT and demonstrate state-of-the-art performance on both image and video classification benchmarks, including ImageNet-1K and several motion-centric datasets (Kinetics-400, Something-Something V1/V2, Diving-48, FineGym). The results indicate that incorporating structural patterns in attention improves relational reasoning, motion modeling, and scene understanding, offering a scalable, structure-aware alternative to conventional self-attention. The work also provides extensive ablations and visualizations to validate the mechanism and its benefits across dense prediction tasks and downstream applications.

Abstract

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.
Paper Structure (29 sections, 15 equations, 4 figures, 9 tables)

This paper contains 29 sections, 15 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Structural Self-Attention. Given an input video and a query indicated by the red box in (a), the query-key correlation maps in (b) clearly reveal the structures of spatial layout and motion with respect to the query. The proposed attention mechanism in (c) is designed to leverage these rich structural patterns for computing attention scores in the self-attention process.
  • Figure 2: Visualization of ConvSA and StructSA on ImageNet-1K. The query location $i$ is set to the center of the image and the kernel size $M=3\times3$. Given the left input image, we compare ConvSA ($D=1$) and StructSA ($D=8$) in terms of (a) $D$ attention maps $\sigma_{jD} ( {\bm{q}}_i {\bm{K}}_j^\mathsf{T} {{\bm{U}}^{\mathrm{K}}}^\mathsf{T})$, (b) local feature aggregation patterns learned in ${\bm{U}}^V$, and (c) the combinations of (a) and (b). Note that in (c), each location $j$ has an aggregation map of the kernel size $M=3\times3$ and thus we also show enlarged images for four different sample locations $j$.
  • Figure 3: Visualization of dynamic kernels $\bm\kappa^{\mathrm{struct}}_{i,j}$ in StructSA on Something-Something-V1. The top row shows the input frames that contain the input spatiotemporal local context (indicated by green boxes) used in the dynamic kernel computation. The bottom row presents the resulting dynamic kernels $\bm\kappa^\mathrm{struct}_{i,j}$ for a StructSA head when $i=j$. Note that the computed dynamic kernels are computed with self-similarity map ($i=j$) to illustrate its effectiveness in capturing motions in videos. We use StructViT-S-4-1 with $M=5 \times 5 \times 5$.
  • Figure 4: Attention map visualization of SA, ConvSA, and StructSA on ImageNet-1K. The query location $i$ is set to the center of the image and the kernel size $M=3 \times 3$. Given (a) input images, we illustrate (b) attention maps of SA, (c) dynamic kernels $\bm{\kappa}^{\mathrm{conv}}_{i,j}$, (d) final attention maps of ConvSA, i.e., aggregated weights of $\bm{\kappa}^{\mathrm{conv}}_{i,j}$, (e) dynamic kernels $\bm{\kappa}^{\mathrm{struct}}_{i,j}$, and (f) final attention maps of StructSA, i.e., aggregated weights of $\bm{\kappa}^{\mathrm{struct}}_{i,j}$, respectively. Note that in (c) and (e), each location $j$ has an aggregation map of the kernel size $M=3 \times 3$ and thus we show enlarged images for three different sampled locations $j$.