Table of Contents
Fetching ...

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

Meng Lou, Shu Zhang, Hong-Yu Zhou, Sibei Yang, Chuan Wu, Yizhou Yu

TL;DR

This paper tackles the limited representation capacity of hybrid CNN-Transformer backbones caused by static convolutions and restricted receptive fields. It introduces Dual Dynamic Token Mixer (D-Mixer), which splits features into two branches processed by Overlapping Spatial Reduction Attention (OSRA) for global context and Input-dependent Depthwise Convolution (IDConv) for local, input-adaptive processing, followed by a lightweight Squeezed Token Enhancer and a Multi-scale FFN (MS-FFN). Stacking D-Mixers into a four-stage TransXNet backbone yields state-of-the-art results on ImageNet-1K with lower computational cost, and strong transfer to object detection and semantic segmentation, supported by analyses of effective receptive field and Grad-CAM visualizations. The results demonstrate that integrating global and local dynamics in an input-dependent manner can substantially improve both generalization and efficiency in vision models, with practical impact on large-scale visual recognition and dense prediction tasks.

Abstract

Recent studies have integrated convolutions into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as the latter computes attention maps dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the entire network. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) to simultaneously learn global and local dynamics via computing input-dependent global and local aggregation weights. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K classification, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is publicly available at https://github.com/LMMMEng/TransXNet.

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

TL;DR

This paper tackles the limited representation capacity of hybrid CNN-Transformer backbones caused by static convolutions and restricted receptive fields. It introduces Dual Dynamic Token Mixer (D-Mixer), which splits features into two branches processed by Overlapping Spatial Reduction Attention (OSRA) for global context and Input-dependent Depthwise Convolution (IDConv) for local, input-adaptive processing, followed by a lightweight Squeezed Token Enhancer and a Multi-scale FFN (MS-FFN). Stacking D-Mixers into a four-stage TransXNet backbone yields state-of-the-art results on ImageNet-1K with lower computational cost, and strong transfer to object detection and semantic segmentation, supported by analyses of effective receptive field and Grad-CAM visualizations. The results demonstrate that integrating global and local dynamics in an input-dependent manner can substantially improve both generalization and efficiency in vision models, with practical impact on large-scale visual recognition and dense prediction tasks.

Abstract

Recent studies have integrated convolutions into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as the latter computes attention maps dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the entire network. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) to simultaneously learn global and local dynamics via computing input-dependent global and local aggregation weights. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K classification, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is publicly available at https://github.com/LMMMEng/TransXNet.
Paper Structure (26 sections, 5 equations, 8 figures, 12 tables)

This paper contains 26 sections, 5 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: A comparison of Top-1 accuracy on the ImageNet-1K dataset with recent state-of-the-art methods. Our proposed TransXNet model achieves superior performance compared to existing approaches.
  • Figure 2: Visualization of effective receptive fields (ERF). The results are obtained by averaging over 100 images from ImageNet-1K.
  • Figure 3: The overall architecture of the proposed TransXNet.
  • Figure 4: Workflow of the proposed D-Mixer.
  • Figure 5: (a) Vanilla FFN only handles cross-channel information. (b) Inverted Residual FFN further aggregates tokens in a small region. (c) Our MS-FFN performs multi-scale token aggregations.
  • ...and 3 more figures