Co-Scale Conv-Attentional Image Transformers
Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu
TL;DR
CoaT introduces a co-scale cross-scale attention framework and a conv-attentional module that uses factorized attention with convolution-based position encodings to deliver efficient, multi-scale vision transformers. The architecture blends serial and parallel blocks to preserve and fuse multi-scale representations, achieving strong ImageNet results and superior performance on COCO tasks relative to similarly sized CNNs and ViTs. Ablation studies validate the importance of positional encodings and the co-scale mechanism, while highlighting a computational cost trade-off that motivates future optimizations. Overall, CoaT advances practical multi-scale Transformer design for high-resolution vision tasks with demonstrated downstream applicability.
Abstract
In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.
