Table of Contents
Fetching ...

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, Yingyan Celine Lin

TL;DR

Castling-ViT addresses the quadratic-in-token cost of Vision Transformers by training with both a linear-angular attention and a masked softmax-based quadratic attention, then casting away the costly softmax branch at inference. The angular kernel-based linear-angular attention captures global-local context via a spectral-angle similarity, while high-order residuals are approximated with a depthwise convolution and a sparsified auxiliary attention that progressively vanishes during training. This yields an efficient $O(N)$-complexity inference with comparable or superior accuracy to softmax-based ViTs, evidenced by improvements on ImageNet and COCO across multiple backbone families and tasks. The approach offers a general drop-in replacement for ViT attention that significantly reduces MACs without sacrificing performance, facilitating practical deployment on resource-constrained platforms.

Abstract

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

TL;DR

Castling-ViT addresses the quadratic-in-token cost of Vision Transformers by training with both a linear-angular attention and a masked softmax-based quadratic attention, then casting away the costly softmax branch at inference. The angular kernel-based linear-angular attention captures global-local context via a spectral-angle similarity, while high-order residuals are approximated with a depthwise convolution and a sparsified auxiliary attention that progressively vanishes during training. This yields an efficient -complexity inference with comparable or superior accuracy to softmax-based ViTs, evidenced by improvements on ImageNet and COCO across multiple backbone families and tasks. The approach offers a general drop-in replacement for ViT attention that significantly reduces MACs without sacrificing performance, facilitating practical deployment on resource-constrained platforms.

Abstract

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.
Paper Structure (16 sections, 10 equations, 7 figures, 11 tables)

This paper contains 16 sections, 10 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Castling-ViT over SOTA baselines on (1) ImageNet deng2009imagenet image classification and (2) COCO lin2014microsoft object detection.
  • Figure 2: Illustration of self-attention used in LeViT and MViTv2.
  • Figure 3: (a) A simple 2D input space and (b) the corresponding 3D feature space after applying the angular kernel to the 2D input data. Note that real input data/features are of higher dimensions.
  • Figure 4: The visualization of our Castling-ViT with both linear-angular attention and auxiliary softmax-based self-attention.
  • Figure 5: Castling vs. baseline ViT on ImageNet.
  • ...and 2 more figures