Table of Contents
Fetching ...

QMaxViT-Unet+: A Query-Based MaxViT-Unet with Edge Enhancement for Scribble-Supervised Segmentation of Medical Images

Thien B. Nguyen-Tat, Hoang-An Vo, Phuoc-Sang Dang

TL;DR

This work tackles scribble-supervised medical image segmentation by introducing QMaxViT-Unet+, a MaxViT-based U-Net variant that integrates a query-guided Transformer decoder and an edge enhancement module to compensate for boundary information missing in scribble labels. The architecture replaces conventional encoder/decoder blocks with MaxViT stages, uses a dual-decoder pathway with an auxiliary output, and leverages a loss framework combining scribble, pseudo-label, and edge supervision. Across four public datasets (ACDC, MS-CMRSeg, SUN-SEG, BUSI), the approach achieves high Dice scores and competitive boundary accuracy while reducing annotation costs compared to fully supervised methods; cross-dataset tests show robust generalization. The work also includes extensive ablations and analyses, highlighting the contributions of the Edge module and the query-based decoder to segmentation quality and boundary delineation, and it discusses limitations and future enhancements, including improved edge supervision and unsupervised pre-training opportunities.

Abstract

The deployment of advanced deep learning models for medical image segmentation is often constrained by the requirement for extensively annotated datasets. Weakly-supervised learning, which allows less precise labels, has become a promising solution to this challenge. Building on this approach, we propose QMaxViT-Unet+, a novel framework for scribble-supervised medical image segmentation. This framework is built on the U-Net architecture, with the encoder and decoder replaced by Multi-Axis Vision Transformer (MaxViT) blocks. These blocks enhance the model's ability to learn local and global features efficiently. Additionally, our approach integrates a query-based Transformer decoder to refine features and an edge enhancement module to compensate for the limited boundary information in the scribble label. We evaluate the proposed QMaxViT-Unet+ on four public datasets focused on cardiac structures, colorectal polyps, and breast cancer: ACDC, MS-CMRSeg, SUN-SEG, and BUSI. Evaluation metrics include the Dice similarity coefficient (DSC) and the 95th percentile of Hausdorff distance (HD95). Experimental results show that QMaxViT-Unet+ achieves 89.1\% DSC and 1.316mm HD95 on ACDC, 88.4\% DSC and 2.226mm HD95 on MS-CMRSeg, 71.4\% DSC and 4.996mm HD95 on SUN-SEG, and 69.4\% DSC and 50.122mm HD95 on BUSI. These results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency while remaining competitive with fully-supervised learning approaches. This makes it ideal for medical image analysis, where high-quality annotations are often scarce and require significant effort and expense. The code is available at: https://github.com/anpc849/QMaxViT-Unet

QMaxViT-Unet+: A Query-Based MaxViT-Unet with Edge Enhancement for Scribble-Supervised Segmentation of Medical Images

TL;DR

This work tackles scribble-supervised medical image segmentation by introducing QMaxViT-Unet+, a MaxViT-based U-Net variant that integrates a query-guided Transformer decoder and an edge enhancement module to compensate for boundary information missing in scribble labels. The architecture replaces conventional encoder/decoder blocks with MaxViT stages, uses a dual-decoder pathway with an auxiliary output, and leverages a loss framework combining scribble, pseudo-label, and edge supervision. Across four public datasets (ACDC, MS-CMRSeg, SUN-SEG, BUSI), the approach achieves high Dice scores and competitive boundary accuracy while reducing annotation costs compared to fully supervised methods; cross-dataset tests show robust generalization. The work also includes extensive ablations and analyses, highlighting the contributions of the Edge module and the query-based decoder to segmentation quality and boundary delineation, and it discusses limitations and future enhancements, including improved edge supervision and unsupervised pre-training opportunities.

Abstract

The deployment of advanced deep learning models for medical image segmentation is often constrained by the requirement for extensively annotated datasets. Weakly-supervised learning, which allows less precise labels, has become a promising solution to this challenge. Building on this approach, we propose QMaxViT-Unet+, a novel framework for scribble-supervised medical image segmentation. This framework is built on the U-Net architecture, with the encoder and decoder replaced by Multi-Axis Vision Transformer (MaxViT) blocks. These blocks enhance the model's ability to learn local and global features efficiently. Additionally, our approach integrates a query-based Transformer decoder to refine features and an edge enhancement module to compensate for the limited boundary information in the scribble label. We evaluate the proposed QMaxViT-Unet+ on four public datasets focused on cardiac structures, colorectal polyps, and breast cancer: ACDC, MS-CMRSeg, SUN-SEG, and BUSI. Evaluation metrics include the Dice similarity coefficient (DSC) and the 95th percentile of Hausdorff distance (HD95). Experimental results show that QMaxViT-Unet+ achieves 89.1\% DSC and 1.316mm HD95 on ACDC, 88.4\% DSC and 2.226mm HD95 on MS-CMRSeg, 71.4\% DSC and 4.996mm HD95 on SUN-SEG, and 69.4\% DSC and 50.122mm HD95 on BUSI. These results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency while remaining competitive with fully-supervised learning approaches. This makes it ideal for medical image analysis, where high-quality annotations are often scarce and require significant effort and expense. The code is available at: https://github.com/anpc849/QMaxViT-Unet

Paper Structure

This paper contains 24 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Examples of dense, scribble annotations and edge information from the ACDC, MS-CMRSeg, SUN-SEG, and BUSI dataset. UP, BG, Polyp, BC, RV, Myo, and LV represent the unannotated, background, colon polyp, breast cancer, right ventricle, myocardium, and left ventricle pixels, respectively.
  • Figure 2: Simple Query enhancer. The edge features extracted from the Edge enhancement module are processed through an adaptive pooling layer and a linear layer. These features are then combined with zero-initialized queries to create improved query representations.
  • Figure 3: QMaxViT-Unet+ architecture. The proposed architecture is based on the U-Net framework, with the conventional U-Net blocks replaced by MaxViT blocks. To improve segmentation accuracy, we incorporate the PPM-FPN module, the Query-guided Transformer decoder, the Edge enhancement module, and the Query enhancer. For readability, skip connections and positional embeddings are omitted from the diagram.
  • Figure 4: Loss Functions
  • Figure 5: Features visualization of the last E-block before and after refinement by the query-based Transformer decoder. Note that the full feature set consists of 768 features, but only a subset is shown here for readability.
  • ...and 2 more figures