DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang; Bodong Zhang; Beatrice S. Knudsen; Tolga Tasdizen

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang, Bodong Zhang, Beatrice S. Knudsen, Tolga Tasdizen

TL;DR

DuoFormer addresses data efficiency and cross-scale reasoning in vision by fusing CNN-based hierarchical features with a Transformer via multi-scale patch tokenization and a Duo Attention module that combines scale and patch attentions. The method constructs X^t_sum ∈ R^{S x N x D} from four CNN stages, employs a scale token to guide cross-scale interactions, and enables local-global processing through scale and patch attentions. Empirical results on Utah ccRCC and TCGA ccRCC show consistent improvements over ResNet baselines and Hybrid-ViTs, with gains up to +3.83% under supervised pretraining and +9.88% under self-supervised pretraining, demonstrating robustness on small to medium medical datasets. Ablation studies confirm the complementary roles of scale attention, patch attention, and the scale token, while illustrating practical considerations for multi-scale representations across dataset sizes. The work offers a plug-and-play approach to integrate hierarchical CNN features into ViTs, enabling effective medical image classification with improved generalization.

Abstract

We here propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization. We also introduce a 'scale attention' mechanism that captures cross-scale dependencies, complementing patch attention to enhance spatial understanding and preserve global perception. Our approach significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalizability. The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 5 figures, 3 tables)

This paper contains 17 sections, 3 equations, 5 figures, 3 tables.

Introduction
Related Work
Methodology
Multi-scale Patch Tokenization
Duo Attention Module
Scale Token
Experiments
Experimental Setup
Result Analysis
Ablation Studies
Ablation on Scale Attention
Ablation on Scale Token
Ablation on Multi-Scale Representations
Conclusion
Appendix
...and 2 more sections

Figures (5)

Figure 1: The pipeline of the proposed DuoFormer. Dimensionalities of the multi-scale representation: S: scale dimension; P: number of patches; D: embedding dimension.
Figure 2: Visualization of Multiscale Patch Tokenization: This figure depicts the process of converting an image into a sequence of multi-scale patch embeddings, with each color representing a different scale to illustrate the varied dimensions of the patches.
Figure 3: Illustration of the Duo attentions. Panel (a) shows the local (yellow arrows) and global (blue arrows) dependencies among multi-scale patches, maintaining a consistent grid size of 49; larger patches indicate greater embedding lengths. Panel (b) details the model architecture, including L layers of scale and patch attention blocks in the encoder.
Figure 4: Ablation study on combinations of hierarchical stages. Stages are represented by colors from light to dark. Bar heights and black error bars show mean accuracies and standard deviations, and the blue dashed line marks the ResNet baseline.
Figure 5: Ablation studies comparing the number of layers and heads in the dual attention modules for both the TCGA (solid bars) and Utah (striped bars) datasets. The dashed lines represent ResNet baselines for each dataset. Each configuration synchronizes the layers between scale and patch attention.

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

TL;DR

Abstract

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (5)