Table of Contents
Fetching ...

CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, Kuang Gong

TL;DR

This work tackles the challenge of partially labeled multi-dataset medical image segmentation by introducing CDPDNet, a universal framework that fuses a self-supervised DINOv2 vision backbone with a CNN encoder and CLIP text embeddings, guided by task-specific prompts. It combines multi-scale visual-text alignment via cross-attention, and a Text-based Task Prompt Generation (TTPG) module to produce task-aware prompts that steer the mask decoder for accurate segmentation across organs and tumors. The approach yields state-of-the-art results across 11 CT datasets (32 ROIs) and demonstrates strong generalization to unseen datasets, with an average Dice score of 77.47% and Hausdorff distance of 17.62. These findings suggest CDPDNet as a robust, scalable, and generalizable solution for medical image segmentation under partial labeling, with potential impact on multi-center clinical workflows.

Abstract

Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: https://github.com/wujiong-hub/CDPDNet.git.

CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

TL;DR

This work tackles the challenge of partially labeled multi-dataset medical image segmentation by introducing CDPDNet, a universal framework that fuses a self-supervised DINOv2 vision backbone with a CNN encoder and CLIP text embeddings, guided by task-specific prompts. It combines multi-scale visual-text alignment via cross-attention, and a Text-based Task Prompt Generation (TTPG) module to produce task-aware prompts that steer the mask decoder for accurate segmentation across organs and tumors. The approach yields state-of-the-art results across 11 CT datasets (32 ROIs) and demonstrates strong generalization to unseen datasets, with an average Dice score of 77.47% and Hausdorff distance of 17.62. These findings suggest CDPDNet as a robust, scalable, and generalizable solution for medical image segmentation under partial labeling, with potential impact on multi-center clinical workflows.

Abstract

Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: https://github.com/wujiong-hub/CDPDNet.git.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the proposed CLIP-DINO-Prompt Driven segmentation network (CDPDNet). It comprised three main components: a multimodal encoder integrating a DINOv2, a CLIP text encoder, and a CNN-based encoder (Sec. \ref{['sec:multiencoder']}), a Text-based Task Prompt Generation (TTPG) module (Sec. \ref{['sec:ttpg']}), and a mask decoder (Sec. \ref{['sec:decoder']}). DINOv2 and CLIP text encoder extracted the dense visual and textual features. Vision features from DINOv2 and convolutional blocks were fused by leveraging cross-attention modules. Afterward, text features were aligned with the fused visual features using the alignment function $\psi$ (Sec. \ref{['clip']}). Task-specific prompt was generated from the TTPG module and injected into the mask decoder to guide the final segmentation map prediction.
  • Figure 2: Architecture of the proposed Text-based Task Prompt Generation (TTPG) module.
  • Figure 3: (a) Training and testing image composition. (b) Annotated 25 organs, 6 tumors, and a kidney cyst for 11 different segmentation tasks (datasets).
  • Figure 4: Visual comparison of segmentation methods on 5 representative organ segmentation samples from the testing dataset. The first column shows the image, and subsequent columns present results from ground truth and 8 comparison methods.
  • Figure 5: Visual comparison of segmentation methods on 5 representative tumor segmentation samples from the testing dataset. The first column shows the image, and subsequent columns present results from ground truth and 8 comparison methods.
  • ...and 1 more figures