Table of Contents
Fetching ...

TiC: Exploring Vision Transformer in Convolution

Song Zhang, Qingzhong Wang, Jiang Bian, Haoyi Xiong

TL;DR

This paper addresses the rigidity and high computational cost of ViT models when handling images with varying resolutions. It introduces Multi-Head Self-Attention Convolution (MSA-Conv), a mechanism that integrates self-attention into generalized convolutions (standard, dilated, and depthwise) to support arbitrary image sizes without retraining. Building on MSA-Conv, the Vision Transformer in Convolution (TiC) provides a hierarchical, multi-scale architecture with two enhancement strategies—Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism—to enlarge the effective receptive field and reinforce long-range token connections. Empirical results on ImageNet-1K show TiC is competitive with state-of-the-art ViT-based models, offering favorable efficiency at higher resolutions and validating the viability of combining convolutional inductive biases with transformer-style attention. The work opens promising avenues for flexible, scalable vision models that fuse the strengths of CNNs and ViTs, with public code forthcoming.

Abstract

While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.

TiC: Exploring Vision Transformer in Convolution

TL;DR

This paper addresses the rigidity and high computational cost of ViT models when handling images with varying resolutions. It introduces Multi-Head Self-Attention Convolution (MSA-Conv), a mechanism that integrates self-attention into generalized convolutions (standard, dilated, and depthwise) to support arbitrary image sizes without retraining. Building on MSA-Conv, the Vision Transformer in Convolution (TiC) provides a hierarchical, multi-scale architecture with two enhancement strategies—Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism—to enlarge the effective receptive field and reinforce long-range token connections. Empirical results on ImageNet-1K show TiC is competitive with state-of-the-art ViT-based models, offering favorable efficiency at higher resolutions and validating the viability of combining convolutional inductive biases with transformer-style attention. The work opens promising avenues for flexible, scalable vision models that fuse the strengths of CNNs and ViTs, with public code forthcoming.

Abstract

While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 10241024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.
Paper Structure (30 sections, 5 equations, 8 figures, 5 tables)

This paper contains 30 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Effective receptive field b42 of the mainstream backbones and TiC.
  • Figure 2: (a) The architecture of TiC; (b) the architecture of MAS-Conv Transformer Block; (c) the illustration of MSA-Conv.
  • Figure 3: Illustration of MSA-Conv operations. (a): the self-attention w/. local sliding window; (b): w/. dilated sliding window; (c): w/. depthwise sliding window.
  • Figure 4: Illustration of overall Multi-Directional Cyclic Shifted Mechanism. Roll refers to b32.
  • Figure 5: Illustration of the Impact of Different Modules on the Effective Receptive Field.
  • ...and 3 more figures