DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention
Xiaoya Tang, Bodong Zhang, Beatrice S. Knudsen, Tolga Tasdizen
TL;DR
DuoFormer addresses data efficiency and cross-scale reasoning in vision by fusing CNN-based hierarchical features with a Transformer via multi-scale patch tokenization and a Duo Attention module that combines scale and patch attentions. The method constructs X^t_sum ∈ R^{S x N x D} from four CNN stages, employs a scale token to guide cross-scale interactions, and enables local-global processing through scale and patch attentions. Empirical results on Utah ccRCC and TCGA ccRCC show consistent improvements over ResNet baselines and Hybrid-ViTs, with gains up to +3.83% under supervised pretraining and +9.88% under self-supervised pretraining, demonstrating robustness on small to medium medical datasets. Ablation studies confirm the complementary roles of scale attention, patch attention, and the scale token, while illustrating practical considerations for multi-scale representations across dataset sizes. The work offers a plug-and-play approach to integrate hierarchical CNN features into ViTs, enabling effective medical image classification with improved generalization.
Abstract
We here propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization. We also introduce a 'scale attention' mechanism that captures cross-scale dependencies, complementing patch attention to enhance spatial understanding and preserve global perception. Our approach significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalizability. The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.
