Table of Contents
Fetching ...

Revisiting the Integration of Convolution and Attention for Vision Backbone

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau

TL;DR

By offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient.

Abstract

Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.

Revisiting the Integration of Convolution and Attention for Vision Backbone

TL;DR

By offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient.

Abstract

Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.

Paper Structure

This paper contains 16 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Existing integration schemes, e.g., ACMix pan2022acmix, apply MHSAs and Convs at the same granularity (top). In contrast, we affirm that by offloading the burden of extracting fine-grained features to lightweight Convs, MHSAs can be aggressively applied to coarse semantic slots to make spatial mixing more efficient (bottom).
  • Figure 2: IN1k top-1 accuracy vs. FLOPs. While several recent state-of-the-art models (i.e., MaxViT tu2022maxvit, CSWin cswin and SG-Former ren2023sgformer, SMT lin2023smt) lie in almost the same Pareto frontier, our GLNet models move the frontier further to the upper-right with a clear margin.
  • Figure 3: The semantic slots correspond to "soft" irregular semantic regions (left). Compared to using hard-divided regular patches (as adopted by plain ViTs vittouvron2021deit) on the right, our formulation is closer to tokenization in NLP.
  • Figure 4: Structure of our GLMix block. At the core is a pair of conjugated soft clustering and dispatching modules to bridge the set and grid representations and enable local-global fusion.
  • Figure 5: Visualization of semantic slots. For each sample, we show the input image (left), assignment maps of all semantic slots (middle), and four representative slots (right). We use the k-medoids algorithm to select the four representative slots automatically.
  • ...and 3 more figures