Table of Contents
Fetching ...

Polyline Path Masked Attention for Vision Transformer

Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, Deyu Meng

TL;DR

This work addresses the quadratic complexity and implicit spatial encoding of Vision Transformers by introducing Polyline Path Masked Attention (PPMA), which injects a learnable 2D spatial prior into self-attention via a 2D polyline path mask derived from Mamba2. The mask is decomposable into horizontal and vertical components, enabling efficient computation with complexity reductions to $O(N^{3/2})$ for masked attention and $O(N^{2})$ for the mask, and it can be plugged into vanilla and criss-cross attention in ViTs. The proposed four-stage backbone with PPMA blocks achieves state-of-the-art results on ImageNet-1K, COCO, and ADE20K across Tiny, Small, and Base scales, while also providing ablations that highlight the importance of separate horizontal/vertical decay factors and the efficacy of the 2D mask. The approach offers a practical path to explicit spatial adjacency modeling in large-scale vision models, with potential further speedups via GPU-optimized implementations. $

Abstract

Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

Polyline Path Masked Attention for Vision Transformer

TL;DR

This work addresses the quadratic complexity and implicit spatial encoding of Vision Transformers by introducing Polyline Path Masked Attention (PPMA), which injects a learnable 2D spatial prior into self-attention via a 2D polyline path mask derived from Mamba2. The mask is decomposable into horizontal and vertical components, enabling efficient computation with complexity reductions to for masked attention and for the mask, and it can be plugged into vanilla and criss-cross attention in ViTs. The proposed four-stage backbone with PPMA blocks achieves state-of-the-art results on ImageNet-1K, COCO, and ADE20K across Tiny, Small, and Base scales, while also providing ablations that highlight the importance of separate horizontal/vertical decay factors and the efficacy of the 2D mask. The approach offers a practical path to explicit spatial adjacency modeling in large-scale vision models, with potential further speedups via GPU-optimized implementations. $

Abstract

Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

Paper Structure

This paper contains 34 sections, 10 theorems, 40 equations, 15 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

For any matrix ${\bm M} \in \! {\mathbb{R}^{HW \! \times \! HW}}$ and ${\bm{\mathcal{M}}} = {\rm{fold}}\left( {\bm{M}} \right)$, if for $\forall i,j,k,l$, $\exists {{\bm{A}^i}} \! \in \! \mathbb{R}^{W \! \times \! W}~\hbox{and}~ {{\bm{B}^l}} \! \in \! \mathbb{R}^{H \! \times \! H}$, s.t., ${{\bm where ${\bm M}^A, {\bm M}^B, {\bm {\hat{M}}}^A, {\bm {\hat{M}}}^B \! \in \! {\mathbb{R}^{HW \! \tim

Figures (15)

  • Figure 1: (a)-(b) Illustration of the modules in Mamba2 and ViT. (c) Our method adapts the structured mask of Mamba2 to 2D scanning and integrates it with ViT's self-attention.
  • Figure 2: Compared to existing scanning strategies (a) and (b), which flatten 2D tokens into a 1D sequence, our polyline path scanning (c) better preserves the adjacency of 2D tokens.
  • Figure 3: An intuitive example illustrating the polyline path mask on a $4 \! \times \! 4$ grid.
  • Figure 4: Illustration of the efficient algorithm for utilizing the proposed polyline path mask. Left: Naive computation of matrix multiplication. Right: An intuitive illustration of Algorithm \ref{['algorithm: EfficientComputation']}.
  • Figure 5: Overall architecture of the Polyline Path Masked Attention based Vision Transformer.
  • ...and 10 more figures

Theorems & Definitions (13)

  • Theorem 1: Matrix Decomposition
  • Corollary 1: Mask Complexity
  • Theorem 2: Efficient Matrix Multiplication
  • Corollary 2: Masked Attention Complexity
  • Theorem 1: Matrix Decomposition
  • proof
  • Theorem 2: Efficient Matrix Multiplication
  • proof
  • Theorem 3
  • proof
  • ...and 3 more