Table of Contents
Fetching ...

Vision Transformer with Quadrangle Attention

Qiming Zhang, Jing Zhang, Yufei Xu, Dacheng Tao

TL;DR

This work addresses the rigidity of fixed-window attention in vision transformers by introducing Quadrangle Attention (QA), a data-driven mechanism that learns per-head projective transformations to map base windows into adaptive quadrangles. QA enables flexible, long-range context modeling and cross-window information exchange with minimal computational overhead, and is integrated into both plain ViTs and hierarchical Swin-like architectures to form QFormer. Extensive experiments across image classification, object detection, semantic segmentation, and pose estimation demonstrate consistent performance gains over benchmark window-attention models, often rivaling full attention at a fraction of the cost. The results establish QA as a practical and scalable improvement for vision transformers operating on high-resolution inputs and diverse object geometries.

Abstract

Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at \href{https://github.com/ViTAE-Transformer/QFormer}{QFormer}.

Vision Transformer with Quadrangle Attention

TL;DR

This work addresses the rigidity of fixed-window attention in vision transformers by introducing Quadrangle Attention (QA), a data-driven mechanism that learns per-head projective transformations to map base windows into adaptive quadrangles. QA enables flexible, long-range context modeling and cross-window information exchange with minimal computational overhead, and is integrated into both plain ViTs and hierarchical Swin-like architectures to form QFormer. Extensive experiments across image classification, object detection, semantic segmentation, and pose estimation demonstrate consistent performance gains over benchmark window-attention models, often rivaling full attention at a fraction of the cost. The results establish QA as a practical and scalable improvement for vision transformers operating on high-resolution inputs and diverse object geometries.

Abstract

Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at \href{https://github.com/ViTAE-Transformer/QFormer}{QFormer}.
Paper Structure (28 sections, 18 equations, 10 figures, 15 tables)

This paper contains 28 sections, 18 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Comparison of window configurations in previous hand-crafted designs liu2021swin (a) and the proposed learnable quadrangle design (b), as well as their image classification performance under different settings of input size on the ImageNet deng2009imagenet validation set (c).
  • Figure 2: Illustration of window attention (a) and the proposed quadrangle attention (QA) (b).
  • Figure 3: The block of traditional window attention (a), the proposed QA in the plain ViT (b), and the hierarchical ViT (c).
  • Figure 4: Illustration of the projective transformation pipeline. (a) a default window. (b) The relative coordinates with respect to the window center $(x^c,y^c)$ are calculated. (c) The target quadrangle is obtained after projective transformation upon the relative coordinates. (d) The absolute coordinates are recovered by adding the window center coordinates.
  • Figure 5: Comparison of the same transformation to two windows at different locations using the absolute coordinates.
  • ...and 5 more figures