QTSeg: A Query Token-Based Dual-Mix Attention Framework with Multi-Level Feature Distribution for Medical Image Segmentation
Phuong-Nam Tran, Nhat Truong Pham, Duc Ngoc Minh Dang, Eui-Nam Huh, Choong Seon Hong
TL;DR
QTSeg tackles the challenge of balancing local detail and global context in medical image segmentation by introducing a dual-mix attention decoder (DMA) and a multi-level feature distribution (MLFD) module, enabling a CNN- or ViT-based encoder to pair with a lightweight, query-token-driven mask decoder. The DMA comprises CAB, SAM, CTFA, and CFTA to fuse channel and spatial cues with cross-token alignment, while MLFD adaptively redistributes encoder features across decoder stages. On five public datasets (lesion, polyp, breast cancer, cell, retinal vessels) QTSeg achieves state-of-the-art Dice and IoU with lower FLOPs and parameter counts than competing methods, demonstrating strong efficiency and accuracy. The approach offers a practical pathway toward efficient, accurate medical segmentation suitable for clinical deployment, though small and blue objects remain challenging.
Abstract
Medical image segmentation plays a crucial role in assisting healthcare professionals with accurate diagnoses and enabling automated diagnostic processes. Traditional convolutional neural networks (CNNs) often struggle with capturing long-range dependencies, while transformer-based architectures, despite their effectiveness, come with increased computational complexity. Recent efforts have focused on combining CNNs and transformers to balance performance and efficiency, but existing approaches still face challenges in achieving high segmentation accuracy while maintaining low computational costs. Furthermore, many methods underutilize the CNN encoder's capability to capture local spatial information, concentrating primarily on mitigating long-range dependency issues. To address these limitations, we propose QTSeg, a novel architecture for medical image segmentation that effectively integrates local and global information. QTSeg features a dual-mix attention decoder designed to enhance segmentation performance through: (1) a cross-attention mechanism for improved feature alignment, (2) a spatial attention module to capture long-range dependencies, and (3) a channel attention block to learn inter-channel relationships. Additionally, we introduce a multi-level feature distribution module, which adaptively balances feature propagation between the encoder and decoder, further boosting performance. Extensive experiments on five publicly available datasets covering diverse segmentation tasks, including lesion, polyp, breast cancer, cell, and retinal vessel segmentation, demonstrate that QTSeg outperforms state-of-the-art methods across multiple evaluation metrics while maintaining lower computational costs. Our implementation can be found at: https://github.com/tpnam0901/QTSeg (v1.0.0)
