Table of Contents
Fetching ...

Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein

Xiaotong Guo, Deqian Yang, Dan Wang, Haochen Zhao, Yuan Li, Zhilin Sui, Tao Zhou, Lijun Zhang, Yanda Meng

TL;DR

The paper addresses automated 3D segmentation of pulmonary arteries and veins in CT with limited labels by introducing a Language-guided self-Adaptive Cross-Attention Fusion Framework that leverages a pre-trained CLIP model with text adapters and a cross-attention fusion mechanism. The method freezes CLIP, employs domain-specific adapters, and fuses language and image embeddings to produce per-class vessel segmentations, trained on a large real-world dataset of 718 volumes that includes fully and half-labeled cases. Experiments show substantial gains over competitive baselines, achieving a mean $DSC$ of $76.22\%$ on the test set and arteries/veins performance improvements that surpass current state-of-the-art methods, with results supported by ablation analyses. The approach reduces labeling burden while delivering robust segmentation performance, and the dataset and code are slated for public release on acceptance, enhancing reproducibility and clinical relevance.

Abstract

Accurate segmentation of pulmonary structures iscrucial in clinical diagnosis, disease study, and treatment planning. Significant progress has been made in deep learning-based segmentation techniques, but most require much labeled data for training. Consequently, developing precise segmentation methods that demand fewer labeled datasets is paramount in medical image analysis. The emergence of pre-trained vision-language foundation models, such as CLIP, recently opened the door for universal computer vision tasks. Exploiting the generalization ability of these pre-trained foundation models on downstream tasks, such as segmentation, leads to unexpected performance with a relatively small amount of labeled data. However, exploring these models for pulmonary artery-vein segmentation is still limited. This paper proposes a novel framework called Language-guided self-adaptive Cross-Attention Fusion Framework. Our method adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans, while adaptively aggregating the cross-modality of text and image representations. We propose a s pecially designed adapter module to fine-tune pre-trained CLIP with a self-adaptive learning strategy to effectively fuse the two modalities of embeddings. We extensively validate our method on a local dataset, which is the largest pulmonary artery-vein CT dataset to date and consists of 718 labeled data in total. The experiments show that our method outperformed other state-of-the-art methods by a large margin. Our data and code will be made publicly available upon acceptance.

Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein

TL;DR

The paper addresses automated 3D segmentation of pulmonary arteries and veins in CT with limited labels by introducing a Language-guided self-Adaptive Cross-Attention Fusion Framework that leverages a pre-trained CLIP model with text adapters and a cross-attention fusion mechanism. The method freezes CLIP, employs domain-specific adapters, and fuses language and image embeddings to produce per-class vessel segmentations, trained on a large real-world dataset of 718 volumes that includes fully and half-labeled cases. Experiments show substantial gains over competitive baselines, achieving a mean of on the test set and arteries/veins performance improvements that surpass current state-of-the-art methods, with results supported by ablation analyses. The approach reduces labeling burden while delivering robust segmentation performance, and the dataset and code are slated for public release on acceptance, enhancing reproducibility and clinical relevance.

Abstract

Accurate segmentation of pulmonary structures iscrucial in clinical diagnosis, disease study, and treatment planning. Significant progress has been made in deep learning-based segmentation techniques, but most require much labeled data for training. Consequently, developing precise segmentation methods that demand fewer labeled datasets is paramount in medical image analysis. The emergence of pre-trained vision-language foundation models, such as CLIP, recently opened the door for universal computer vision tasks. Exploiting the generalization ability of these pre-trained foundation models on downstream tasks, such as segmentation, leads to unexpected performance with a relatively small amount of labeled data. However, exploring these models for pulmonary artery-vein segmentation is still limited. This paper proposes a novel framework called Language-guided self-adaptive Cross-Attention Fusion Framework. Our method adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans, while adaptively aggregating the cross-modality of text and image representations. We propose a s pecially designed adapter module to fine-tune pre-trained CLIP with a self-adaptive learning strategy to effectively fuse the two modalities of embeddings. We extensively validate our method on a local dataset, which is the largest pulmonary artery-vein CT dataset to date and consists of 718 labeled data in total. The experiments show that our method outperformed other state-of-the-art methods by a large margin. Our data and code will be made publicly available upon acceptance.
Paper Structure (14 sections, 6 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed Language-guided self-Adaptive Cross-Attention Fusion Framework, which comprises a text encoder and an image segmentation model. Our model can adaptively learn suitable embedding for the left/right vein and artery. Best viewed in color.
  • Figure 2: This pic shows the adaptation of our data. We only change the label of the artery/vein without making any other adjustments.
  • Figure 3: Visualization of segmentation results on our dataset with zoomed-in views for enhanced clarity. (A-D) The segmentation results of one case are presented in the transverse section, coronal section, sagittal section, and 3D view, respectively. The regions enclosed by the dashed yellow boxes indicate misclassification executed by other models; our method can segment those regions closer to the ground truth. Mask colors for vessels have been standardized to red for arteries and green for veins to better visualize the differences between the methods.