HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images
Mohammad Amanour Rahman
TL;DR
HyFormer-Net tackles the challenge of accurate breast lesion segmentation and malignant classification in ultrasound by uniting EfficientNet-B3 and Swin Transformer in a multi-scale fusion framework with an attention-driven decoder. The model demonstrates strong in-distribution performance on BUSI, superior malignant recall, and substantial gains from ensembling, while providing intrinsic interpretability through decoder attention maps and Grad-CAM analyses. A rigorous cross-dataset study reveals that zero-shot transfer suffers from domain shift, but progressive fine-tuning with a modest amount of target-domain data achieves substantial recovery and even surpasses source-domain performance with enough data, establishing practical deployment guidelines. Overall, the work advances clinically relevant AI in ultrasound by delivering a transparent, multitask model with data-efficient adaptation and quantitative interpretability metrics, paving the way for more trustworthy CAD systems in breast cancer screening.
Abstract
B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.
