Table of Contents
Fetching ...

HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images

Mohammad Amanour Rahman

TL;DR

HyFormer-Net tackles the challenge of accurate breast lesion segmentation and malignant classification in ultrasound by uniting EfficientNet-B3 and Swin Transformer in a multi-scale fusion framework with an attention-driven decoder. The model demonstrates strong in-distribution performance on BUSI, superior malignant recall, and substantial gains from ensembling, while providing intrinsic interpretability through decoder attention maps and Grad-CAM analyses. A rigorous cross-dataset study reveals that zero-shot transfer suffers from domain shift, but progressive fine-tuning with a modest amount of target-domain data achieves substantial recovery and even surpasses source-domain performance with enough data, establishing practical deployment guidelines. Overall, the work advances clinically relevant AI in ultrasound by delivering a transparent, multitask model with data-efficient adaptation and quantitative interpretability metrics, paving the way for more trustworthy CAD systems in breast cancer screening.

Abstract

B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.

HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images

TL;DR

HyFormer-Net tackles the challenge of accurate breast lesion segmentation and malignant classification in ultrasound by uniting EfficientNet-B3 and Swin Transformer in a multi-scale fusion framework with an attention-driven decoder. The model demonstrates strong in-distribution performance on BUSI, superior malignant recall, and substantial gains from ensembling, while providing intrinsic interpretability through decoder attention maps and Grad-CAM analyses. A rigorous cross-dataset study reveals that zero-shot transfer suffers from domain shift, but progressive fine-tuning with a modest amount of target-domain data achieves substantial recovery and even surpasses source-domain performance with enough data, establishing practical deployment guidelines. Overall, the work advances clinically relevant AI in ultrasound by delivering a transparent, multitask model with data-efficient adaptation and quantitative interpretability metrics, paving the way for more trustworthy CAD systems in breast cancer screening.

Abstract

B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.

Paper Structure

This paper contains 39 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Representative samples from BUSI dataset.
  • Figure 2: The proposed HyFormer-Net Architecture. The framework consists of four key stages: (A) Dual-Branch Encoder with parallel CNN (EfficientNet-B3) and Swin Transformer streams to capture local and global features, respectively. (B) Multi-Scale Fusion Blocks that synergistically integrate features at four hierarchical levels. (C) Attention-Gated Decoder which uses skip connections filtered by attention maps ($\alpha$) to refine segmentation boundaries. (D) Multi-Task Heads for simultaneous lesion segmentation and classification.
  • Figure 3: Diagram of the Intrinsic Attention Validation Pipeline. The raw attention map ($\alpha$) is extracted from the decoder's Attention Gate. It is then upsampled, binarized via Otsu's thresholding, and cleaned with morphological opening serra1983mathematical to create a final attention mask. This mask is then compared against the ground truth using IoU to quantitatively measure the model's spatial focus.
  • Figure 4: Schematic of the Post-Hoc Grad-CAM Pipeline. For a given input image, gradients from the target class output are backpropagated to the final convolutional feature maps of the encoder. These gradients are globally pooled to obtain channel importance weights ($\alpha_k$), which are then used to compute a weighted sum of the feature maps, producing the final heatmap that localizes class-discriminative regions.
  • Figure 5: Domain adaptation learning curve. Blue line: HyFormer-Net Dice Score on external dataset with progressive fine-tuning. Red dashed line: BUSI in-distribution baseline (single model: 76.1% Dice, ensemble: 90.2% Dice). Green shaded region: 95% confidence intervals. Orange annotation highlights performance ceiling breakthrough at 50% fine-tuning, exceeding source domain by +1.2%. The steep initial gradient (0% → 10%) demonstrates exceptional sample efficiency for practical deployment.
  • ...and 2 more figures