HBFormer: A Hybrid-Bridge Transformer for Microtumor and Miniature Organ Segmentation
Fuchen Zheng, Xinyi Chen, Weixuan Li, Quanjun Li, Junhua Zhou, Xiaojiao Guo, Xuhang Chen, Chi-Man Pun, Shoujun Zhou
TL;DR
HBFormer addresses the segmentation of microtumors and miniature organs by bridging local detail and global context through a hybrid CNN–Transformer architecture. It combines a Swin Transformer encoder with a novel Multi-Scale Feature Fusion (MFF) decoder, including the Med-DSPP block, to explicitly fuse multi-scale features with long-range context. Across LiTS2017, ISICDM2019, and Synapse, HBFormer delivers state-of-the-art performance, with notable gains in boundary precision and boundary-aware delineation of small structures. The results demonstrate the effectiveness of the MFF bridge and the enhanced encoder in handling diverse anatomies and pathologies, suggesting strong potential for clinical deployment and future 3D extensions.
Abstract
Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The 'Hybrid' design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its 'Bridge' mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: https://github.com/lzeeorno/HBFormer.
