Table of Contents
Fetching ...

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Zhoujie Qian

TL;DR

ECViT addresses the inefficiency and data requirements of Vision Transformers by fusing CNN-derived locality with Transformer encoders. It tokenizes images from low-level CNN features, applies Partitioned Multi-head Self-Attention and Interactive Feed-forward Networks within a pyramid, multi-scale framework, and uses Tokens Merging to progressively reduce tokens while increasing feature dimensionality. Empirical results show ECViT achieves strong accuracy with minimal parameters (e.g., 4.888M) and FLOPs (0.698G), outperforming several CNN- and ViT-based baselines on multiple datasets without pretraining. This hybrid approach delivers a practical, resource-efficient solution for real-world vision tasks and edge deployments.

Abstract

Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. However, ViTs face challenges such as high computational costs due to the quadratic scaling of self-attention and the requirement of a large amount of training data. To address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance.

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

TL;DR

ECViT addresses the inefficiency and data requirements of Vision Transformers by fusing CNN-derived locality with Transformer encoders. It tokenizes images from low-level CNN features, applies Partitioned Multi-head Self-Attention and Interactive Feed-forward Networks within a pyramid, multi-scale framework, and uses Tokens Merging to progressively reduce tokens while increasing feature dimensionality. Empirical results show ECViT achieves strong accuracy with minimal parameters (e.g., 4.888M) and FLOPs (0.698G), outperforming several CNN- and ViT-based baselines on multiple datasets without pretraining. This hybrid approach delivers a practical, resource-efficient solution for real-world vision tasks and edge deployments.

Abstract

Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. However, ViTs face challenges such as high computational costs due to the quadratic scaling of self-attention and the requirement of a large amount of training data. To address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance.

Paper Structure

This paper contains 25 sections, 17 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Performance comparison with the lightweight state-of-the-art models on CIFAR10. Our ECViT achieves the best balance between performance and efficiency.
  • Figure 2: Model overview. Our model operates in three stages, each generating feature maps at different scales. The first stage employs a convolutional network to extract low-dimensional features, which are then transformed into a sequence of tokens, including a class token. The next two stages share a similar architecture, consisting of several convolution-enhanced transformer encoder layers and a merging layer. Each encoder contains two sub-layers: partitioned multi-head self-attention (P-MSA) and Interactive Feed-forward Network (I-FFN), with residual connections applied after each module. A merging layer is then applied to reduce the sequence length and increase the feature dimensionality. Finally, the class token is utilized for prediction through an MLP Head.
  • Figure 3: Partitioned Multi-head Self-Attention
  • Figure 4: Interactive Feed-forward Network