Table of Contents
Fetching ...

QTSeg: A Query Token-Based Dual-Mix Attention Framework with Multi-Level Feature Distribution for Medical Image Segmentation

Phuong-Nam Tran, Nhat Truong Pham, Duc Ngoc Minh Dang, Eui-Nam Huh, Choong Seon Hong

TL;DR

QTSeg tackles the challenge of balancing local detail and global context in medical image segmentation by introducing a dual-mix attention decoder (DMA) and a multi-level feature distribution (MLFD) module, enabling a CNN- or ViT-based encoder to pair with a lightweight, query-token-driven mask decoder. The DMA comprises CAB, SAM, CTFA, and CFTA to fuse channel and spatial cues with cross-token alignment, while MLFD adaptively redistributes encoder features across decoder stages. On five public datasets (lesion, polyp, breast cancer, cell, retinal vessels) QTSeg achieves state-of-the-art Dice and IoU with lower FLOPs and parameter counts than competing methods, demonstrating strong efficiency and accuracy. The approach offers a practical pathway toward efficient, accurate medical segmentation suitable for clinical deployment, though small and blue objects remain challenging.

Abstract

Medical image segmentation plays a crucial role in assisting healthcare professionals with accurate diagnoses and enabling automated diagnostic processes. Traditional convolutional neural networks (CNNs) often struggle with capturing long-range dependencies, while transformer-based architectures, despite their effectiveness, come with increased computational complexity. Recent efforts have focused on combining CNNs and transformers to balance performance and efficiency, but existing approaches still face challenges in achieving high segmentation accuracy while maintaining low computational costs. Furthermore, many methods underutilize the CNN encoder's capability to capture local spatial information, concentrating primarily on mitigating long-range dependency issues. To address these limitations, we propose QTSeg, a novel architecture for medical image segmentation that effectively integrates local and global information. QTSeg features a dual-mix attention decoder designed to enhance segmentation performance through: (1) a cross-attention mechanism for improved feature alignment, (2) a spatial attention module to capture long-range dependencies, and (3) a channel attention block to learn inter-channel relationships. Additionally, we introduce a multi-level feature distribution module, which adaptively balances feature propagation between the encoder and decoder, further boosting performance. Extensive experiments on five publicly available datasets covering diverse segmentation tasks, including lesion, polyp, breast cancer, cell, and retinal vessel segmentation, demonstrate that QTSeg outperforms state-of-the-art methods across multiple evaluation metrics while maintaining lower computational costs. Our implementation can be found at: https://github.com/tpnam0901/QTSeg (v1.0.0)

QTSeg: A Query Token-Based Dual-Mix Attention Framework with Multi-Level Feature Distribution for Medical Image Segmentation

TL;DR

QTSeg tackles the challenge of balancing local detail and global context in medical image segmentation by introducing a dual-mix attention decoder (DMA) and a multi-level feature distribution (MLFD) module, enabling a CNN- or ViT-based encoder to pair with a lightweight, query-token-driven mask decoder. The DMA comprises CAB, SAM, CTFA, and CFTA to fuse channel and spatial cues with cross-token alignment, while MLFD adaptively redistributes encoder features across decoder stages. On five public datasets (lesion, polyp, breast cancer, cell, retinal vessels) QTSeg achieves state-of-the-art Dice and IoU with lower FLOPs and parameter counts than competing methods, demonstrating strong efficiency and accuracy. The approach offers a practical pathway toward efficient, accurate medical segmentation suitable for clinical deployment, though small and blue objects remain challenging.

Abstract

Medical image segmentation plays a crucial role in assisting healthcare professionals with accurate diagnoses and enabling automated diagnostic processes. Traditional convolutional neural networks (CNNs) often struggle with capturing long-range dependencies, while transformer-based architectures, despite their effectiveness, come with increased computational complexity. Recent efforts have focused on combining CNNs and transformers to balance performance and efficiency, but existing approaches still face challenges in achieving high segmentation accuracy while maintaining low computational costs. Furthermore, many methods underutilize the CNN encoder's capability to capture local spatial information, concentrating primarily on mitigating long-range dependency issues. To address these limitations, we propose QTSeg, a novel architecture for medical image segmentation that effectively integrates local and global information. QTSeg features a dual-mix attention decoder designed to enhance segmentation performance through: (1) a cross-attention mechanism for improved feature alignment, (2) a spatial attention module to capture long-range dependencies, and (3) a channel attention block to learn inter-channel relationships. Additionally, we introduce a multi-level feature distribution module, which adaptively balances feature propagation between the encoder and decoder, further boosting performance. Extensive experiments on five publicly available datasets covering diverse segmentation tasks, including lesion, polyp, breast cancer, cell, and retinal vessel segmentation, demonstrate that QTSeg outperforms state-of-the-art methods across multiple evaluation metrics while maintaining lower computational costs. Our implementation can be found at: https://github.com/tpnam0901/QTSeg (v1.0.0)

Paper Structure

This paper contains 26 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The comparison of Dice score and FLOPs on the ISIC2016 dataset between QTSeg and other methods. It shows that our proposed QTSeg outperforms all the other methods in IoU score with small FLOPs. Larger circles indicate higher parameter sizes.
  • Figure 2: Comparative conceptual of architectures for medical image segmentation. a) The vanilla technique using CNN U-Shaped (e.g., UNet unet). b) The cascaded architecture of CNN and transformer module utilizing in TransUnet transunet. c) The pure transformer architecture for image segmentation in SwinUnet swinunet. d) The efficient architecture in encoder image feature using a hybrid transformer in H2Former h2former. e) Our proposed QTSeg architecture.
  • Figure 3: The overall architecture of the proposed method.
  • Figure 4: Detailed illustration of the Dual-Mix Attention Decoder.
  • Figure 5: The comparison of visualization prediction between QTSeg and other methods on the ISIC2016 dataset.
  • ...and 4 more figures