TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting

Muhammad Hamza Sharif; Dmitry Demidov; Asif Hanif; Mohammad Yaqub; Min Xu

TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting

Muhammad Hamza Sharif, Dmitry Demidov, Asif Hanif, Mohammad Yaqub, Min Xu

TL;DR

TransResNet tackles high-resolution medical image segmentation by combining parallel CNN and Transformer encoders with a Cross Grafting Module that fuses local details and global semantics. The CGM aligns and integrates feature maps from ResNet-18 and Swin-B to produce grafted features used in decoding, with a joint loss incorporating segmentation, attention, and auxiliary terms. Evaluations across ten high-resolution datasets (skin lesions, retinal vessels, and polyps) show state-of-the-art or competitive performance, highlighting improved information flow for precise masks while acknowledging higher computational cost. The approach is open-sourced, offering a strong foundation for high-resolution medical image analysis and future extensions to multi-class tasks and efficiency optimizations.

Abstract

High-resolution images are preferable in medical imaging domain as they significantly improve the diagnostic capability of the underlying method. In particular, high resolution helps substantially in improving automatic image segmentation. However, most of the existing deep learning-based techniques for medical image segmentation are optimized for input images having small spatial dimensions and perform poorly on high-resolution images. To address this shortcoming, we propose a parallel-in-branch architecture called TransResNet, which incorporates Transformer and CNN in a parallel manner to extract features from multi-resolution images independently. In TransResNet, we introduce Cross Grafting Module (CGM), which generates the grafted features, enriched in both global semantic and low-level spatial details, by combining the feature maps from Transformer and CNN branches through fusion and self-attention mechanism. Moreover, we use these grafted features in the decoding process, increasing the information flow for better prediction of the segmentation mask. Extensive experiments on ten datasets demonstrate that TransResNet achieves either state-of-the-art or competitive results on several segmentation tasks, including skin lesion, retinal vessel, and polyp segmentation. The source code and pre-trained models are available at https://github.com/Sharifmhamza/TransResNet.

TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 7 figures, 7 tables)

This paper contains 18 sections, 9 equations, 7 figures, 7 tables.

Introduction
Related Work
Methodology
Encoder Module
Cross Grafting Module (CGM)
Decoder Module
Objective Function
Experiments
Datasets
Implementation Details
Evaluation Metric
Quantitative Results
Qualitative Results
Ablation Study
Conclusion
...and 3 more sections

Figures (7)

Figure 1: An overview of the architecture of TransResNet for high-resolution medical image segmentation. Our TransResNet uses the parallel branches from Swin-transformer and Resnet-18 backbones as encoders. The core module of our architecture is Cross Grafting Module (CGM), explained briefly in the Fig. \ref{['fig:grafting']}. The decoder module aggregates the flow of feature input maps from swin block, CGM block, and resnet block. D1, D2, and D3 are subblocks of the decoder with their structure on the right side.
Figure 2: An overview of the architecture of the proposed Cross Grafting Module (CGM). The CGM module takes dual input i.e., the feature maps from Swin-transformer, and Resnet branches, and outputs the grafted features through fusion and self-attention mechanism. These grafted features are used in the decoding process. The module also generates a cross-transposed attention matrix (CTAM), which is used in the objective function.
Figure 3: Qualitative results on all three segmentation tasks. The figure shows an example image, ground truth (GT) and predicted (PRED) segmentation mask for the skin lesion segmentation task (row 1), the polyp segmentation task (row 2) and retinal vessel segmentation task (row 3).
Figure 4: Visualizations of predicted mask with PCS. The left side two images shows the predicted mask without applying PCS while right side represents the sharp mask after applying PCS. The IoU scores represents that performance increases after applying PCS.
Figure 5: Visualizations of skin lesion segmentation
...and 2 more figures

TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting

TL;DR

Abstract

TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting

Authors

TL;DR

Abstract

Table of Contents

Figures (7)