Table of Contents
Fetching ...

MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

Jierui Qu, Jianchun Zhao

TL;DR

This work addresses the need for efficient multi-lesion segmentation in 3D medical images, a scenario where existing SAM-based methods struggle with single-point prompts. It introduces MIQ-SAM3D, which converts a single user click into multiple instance queries via a prompt-conditioned generator and refines them competitively in a joint decoder, all within a hybrid CNN-Transformer encoder that preserves local boundary detail while modeling global context. Empirical results on LiTS17 and KiTS21 show competitive Dice and NSD scores and strong robustness to prompt variations, with ablations confirming the critical roles of PC-IQG, CQRD, and the dual-branch encoder. The approach offers a practical, end-to-end solution for annotating clinically relevant multi-lesion cases and advances promptable 3D medical image segmentation toward real-world deployment.

Abstract

Accurate segmentation of medical images is fundamental to tumor diagnosis and treatment planning. SAM-based interactive segmentation has gained attention for its strong generalization, but most methods follow a single-point-to-single-object paradigm, which limits multi-lesion segmentation. Moreover, ViT backbones capture global context but often miss high-fidelity local details. We propose MIQ-SAM3D, a multi-instance 3D segmentation framework with a competitive query optimization strategy that shifts from single-point-to-single-mask to single-point-to-multi-instance. A prompt-conditioned instance-query generator transforms a single point prompt into multiple specialized queries, enabling retrieval of all semantically similar lesions across the 3D volume from a single exemplar. A hybrid CNN-Transformer encoder injects CNN-derived boundary saliency into ViT self-attention via spatial gating. A competitively optimized query decoder then enables end-to-end, parallel, multi-instance prediction through inter-query competition. On LiTS17 and KiTS21 dataset, MIQ-SAM3D achieved comparable levels and exhibits strong robustness to prompts, providing a practical solution for efficient annotation of clinically relevant multi-lesion cases.

MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

TL;DR

This work addresses the need for efficient multi-lesion segmentation in 3D medical images, a scenario where existing SAM-based methods struggle with single-point prompts. It introduces MIQ-SAM3D, which converts a single user click into multiple instance queries via a prompt-conditioned generator and refines them competitively in a joint decoder, all within a hybrid CNN-Transformer encoder that preserves local boundary detail while modeling global context. Empirical results on LiTS17 and KiTS21 show competitive Dice and NSD scores and strong robustness to prompt variations, with ablations confirming the critical roles of PC-IQG, CQRD, and the dual-branch encoder. The approach offers a practical, end-to-end solution for annotating clinically relevant multi-lesion cases and advances promptable 3D medical image segmentation toward real-world deployment.

Abstract

Accurate segmentation of medical images is fundamental to tumor diagnosis and treatment planning. SAM-based interactive segmentation has gained attention for its strong generalization, but most methods follow a single-point-to-single-object paradigm, which limits multi-lesion segmentation. Moreover, ViT backbones capture global context but often miss high-fidelity local details. We propose MIQ-SAM3D, a multi-instance 3D segmentation framework with a competitive query optimization strategy that shifts from single-point-to-single-mask to single-point-to-multi-instance. A prompt-conditioned instance-query generator transforms a single point prompt into multiple specialized queries, enabling retrieval of all semantically similar lesions across the 3D volume from a single exemplar. A hybrid CNN-Transformer encoder injects CNN-derived boundary saliency into ViT self-attention via spatial gating. A competitively optimized query decoder then enables end-to-end, parallel, multi-instance prediction through inter-query competition. On LiTS17 and KiTS21 dataset, MIQ-SAM3D achieved comparable levels and exhibits strong robustness to prompts, providing a practical solution for efficient annotation of clinically relevant multi-lesion cases.

Paper Structure

This paper contains 12 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our proposed method for MIQ-SAM3D.This framework consists of a hybrid CNN-Transformer encoder, a prompt-conditioned instance query generator (PC-IQG), a competitive query-optimized decoder (CQRD), and an instance prediction head.
  • Figure 2: Structures of the encoder and decoder. (a)Structure diagram of the dual-branch hybrid CNN-Transformer encoder, which uses a spatial gating module to fuse dual-channel features. (b)Structure diagram of CQRD.
  • Figure 3: Comparison with classical medical image segmentation methods on the LiTS17 and KiTS21 datasets. The evaluation metrics for dice scores and normalized surface dice (NSD) were reported.
  • Figure 4: Segmentation results under different single-point prompts