Table of Contents
Fetching ...

Co-Scale Conv-Attentional Image Transformers

Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu

TL;DR

CoaT introduces a co-scale cross-scale attention framework and a conv-attentional module that uses factorized attention with convolution-based position encodings to deliver efficient, multi-scale vision transformers. The architecture blends serial and parallel blocks to preserve and fuse multi-scale representations, achieving strong ImageNet results and superior performance on COCO tasks relative to similarly sized CNNs and ViTs. Ablation studies validate the importance of positional encodings and the co-scale mechanism, while highlighting a computational cost trade-off that motivates future optimizations. Overall, CoaT advances practical multi-scale Transformer design for high-resolution vision tasks with demonstrated downstream applicability.

Abstract

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.

Co-Scale Conv-Attentional Image Transformers

TL;DR

CoaT introduces a co-scale cross-scale attention framework and a conv-attentional module that uses factorized attention with convolution-based position encodings to deliver efficient, multi-scale vision transformers. The architecture blends serial and parallel blocks to preserve and fuse multi-scale representations, achieving strong ImageNet results and superior performance on COCO tasks relative to similarly sized CNNs and ViTs. Ablation studies validate the importance of positional encodings and the co-scale mechanism, while highlighting a computational cost trade-off that motivates future optimizations. Overall, CoaT advances practical multi-scale Transformer design for high-resolution vision tasks with demonstrated downstream applicability.

Abstract

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.

Paper Structure

This paper contains 29 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Model Size vs. ImageNet Accuracy. Our CoaT model significantly outperforms other image Transformers. Details are in Table \ref{['tab:imagenet']}.
  • Figure 2: Illustration of the conv-attentional module. We apply a convolutional position encoding to the image tokens from the input. The resulting features are fed into a factorized attention with a convolutional relative position encoding.
  • Figure 3: CoaT model architecture. (Left) The overall network architecture of CoaT-Lite. CoaT-Lite consists of serial blocks only, where image features are down-sampled and processed in a sequential order. (Right) The overall network architecture of CoaT. CoaT consists of serial blocks and parallel blocks. Both blocks enable the co-scale mechanism.
  • Figure 4: Schematic illustration of the serial block in CoaT. Input feature maps are first down-sampled by a patch embedding layer, and then tokenized features (along with a class token) are processed by multiple conv-attention and feed-forward layers.
  • Figure 5: Schematic illustration of the parallel group in CoaT. For "w/o Co-Scale", tokens learned at the individual scales are combined to perform the classification but absent intermediate co-scale interaction for the individual paths of the parallel blocks. We propose two co-scale variants, namely direct cross-layer attention and attention with feature interpolation. Co-scale with feature interpolation is adopted in the final CoaT-Lite and CoaT models reported on the ImageNet benchmark.