Table of Contents
Fetching ...

On the Connection between Local Attention and Dynamic Depth-wise Convolution

Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, Jingdong Wang

TL;DR

The paper investigates the connection between Local Vision Transformer local attention and dynamic depth-wise convolution by reframing local attention as a channel-wise locally-connected layer with dynamic weights. It identifies three regularization forms—sparse connectivity, weight sharing, and dynamic weight—and analyzes their roles, linking local attention to dynamic depth-wise convolution. Through extensive experiments replacing local attention with depth-wise convolution in Swin Transformer architectures, the authors show that (dynamic) DWNet can match or exceed Swin’s performance on ImageNet, COCO, and ADE with lower computation, and that weight sharing and dynamic weight contribute to improved capacity. The findings suggest that Local Vision Transformers leverage two regularization mechanisms and instance-specific weights to achieve strong performance, offering practical guidance for efficient vision architectures.

Abstract

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.

On the Connection between Local Attention and Dynamic Depth-wise Convolution

TL;DR

The paper investigates the connection between Local Vision Transformer local attention and dynamic depth-wise convolution by reframing local attention as a channel-wise locally-connected layer with dynamic weights. It identifies three regularization forms—sparse connectivity, weight sharing, and dynamic weight—and analyzes their roles, linking local attention to dynamic depth-wise convolution. Through extensive experiments replacing local attention with depth-wise convolution in Swin Transformer architectures, the authors show that (dynamic) DWNet can match or exceed Swin’s performance on ImageNet, COCO, and ADE with lower computation, and that weight sharing and dynamic weight contribute to improved capacity. The findings suggest that Local Vision Transformers leverage two regularization mechanisms and instance-specific weights to achieve strong performance, offering practical guidance for efficient vision architectures.

Abstract

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.

Paper Structure

This paper contains 20 sections, 27 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Illustration of connectivity for (a) convolution, (b) global attention and spatial mixing MLP, (c) local attention and depth-wise convolution, (d) point-wise MLP or $1\times 1$ convolution, and (e) MLP (fully-connected layer). In the spatial dimension, we use $1$D to illustrate the local-connectivity pattern for clarity.
  • Figure 2: Effect of #channels sharing the weights on ImageNet classification. X-axis: #channels within each group / #param. Y-axis: ImageNet classification accuracy. (a) Local MLP: the static version of Swin transformer. (b) Local attention: Swin transformer. Results is reported for tiny model on ImageNet dataset.
  • Figure 3: Relation graph for convolution (Conv.), depth-wise separable convolution (DW-S Conv.), Vision Transformer (ViT) building block, local ViT building block, Sep. MLP (e.g., MLP-Mixer and ResMLP), dynamic depth-wise separable convolution (Dynamic DW-S Conv.), as well as dynamic local separable MLP ( e.g., involution li2021involution and inhomogeneous dynamic depth-wise convolution) in terms of sparse connectivity and dynamic weight. Dim. = dimension including spatial and channel, Sep. = separable, LR = low rank, MS Conv. = multi-scale convolution, PVT = pyramid vision transformer.