Table of Contents
Fetching ...

AKGNet: Attribute Knowledge-Guided Unsupervised Lung-Infected Area Segmentation

Qing En, Yuhong Guo

TL;DR

This work tackles unsupervised segmentation of lung-infected regions using image-text pairs without mask annotations. It introduces AKGNet, which combines text attribute knowledge learning, attribute-image cross-attention, and self-training mask refinement to exploit textual descriptions and spatial correlations for segmentation. The method employs a coarse mask from an unsupervised saliency model, a text-based attribute classifier with mask-guided features, and a cross-modal fusion mechanism, all optimized via a joint loss $L_{total}=\lambda_c L_c+\lambda_a L_a+\lambda_{st} L_{st}$. On the QaTa-COV19 dataset, AKGNet achieves state-of-the-art unsupervised segmentation performance (Dice up to around 53.8–55.5 and Jaccard around 41.8–43.7) with competitive parameter efficiency, demonstrating the value of explicit text attribute knowledge and cross-modal reasoning for medical image segmentation without mask annotations.

Abstract

Lung-infected area segmentation is crucial for assessing the severity of lung diseases. However, existing image-text multi-modal methods typically rely on labour-intensive annotations for model training, posing challenges regarding time and expertise. To address this issue, we propose a novel attribute knowledge-guided framework for unsupervised lung-infected area segmentation (AKGNet), which achieves segmentation solely based on image-text data without any mask annotation. AKGNet facilitates text attribute knowledge learning, attribute-image cross-attention fusion, and high-confidence-based pseudo-label exploration simultaneously. It can learn statistical information and capture spatial correlations between image and text attributes in the embedding space, iteratively refining the mask to enhance segmentation. Specifically, we introduce a text attribute knowledge learning module by extracting attribute knowledge and incorporating it into feature representations, enabling the model to learn statistical information and adapt to different attributes. Moreover, we devise an attribute-image cross-attention module by calculating the correlation between attributes and images in the embedding space to capture spatial dependency information, thus selectively focusing on relevant regions while filtering irrelevant areas. Finally, a self-training mask improvement process is employed by generating pseudo-labels using high-confidence predictions to iteratively enhance the mask and segmentation. Experimental results on a benchmark medical image dataset demonstrate the superior performance of our method compared to state-of-the-art segmentation techniques in unsupervised scenarios.

AKGNet: Attribute Knowledge-Guided Unsupervised Lung-Infected Area Segmentation

TL;DR

This work tackles unsupervised segmentation of lung-infected regions using image-text pairs without mask annotations. It introduces AKGNet, which combines text attribute knowledge learning, attribute-image cross-attention, and self-training mask refinement to exploit textual descriptions and spatial correlations for segmentation. The method employs a coarse mask from an unsupervised saliency model, a text-based attribute classifier with mask-guided features, and a cross-modal fusion mechanism, all optimized via a joint loss . On the QaTa-COV19 dataset, AKGNet achieves state-of-the-art unsupervised segmentation performance (Dice up to around 53.8–55.5 and Jaccard around 41.8–43.7) with competitive parameter efficiency, demonstrating the value of explicit text attribute knowledge and cross-modal reasoning for medical image segmentation without mask annotations.

Abstract

Lung-infected area segmentation is crucial for assessing the severity of lung diseases. However, existing image-text multi-modal methods typically rely on labour-intensive annotations for model training, posing challenges regarding time and expertise. To address this issue, we propose a novel attribute knowledge-guided framework for unsupervised lung-infected area segmentation (AKGNet), which achieves segmentation solely based on image-text data without any mask annotation. AKGNet facilitates text attribute knowledge learning, attribute-image cross-attention fusion, and high-confidence-based pseudo-label exploration simultaneously. It can learn statistical information and capture spatial correlations between image and text attributes in the embedding space, iteratively refining the mask to enhance segmentation. Specifically, we introduce a text attribute knowledge learning module by extracting attribute knowledge and incorporating it into feature representations, enabling the model to learn statistical information and adapt to different attributes. Moreover, we devise an attribute-image cross-attention module by calculating the correlation between attributes and images in the embedding space to capture spatial dependency information, thus selectively focusing on relevant regions while filtering irrelevant areas. Finally, a self-training mask improvement process is employed by generating pseudo-labels using high-confidence predictions to iteratively enhance the mask and segmentation. Experimental results on a benchmark medical image dataset demonstrate the superior performance of our method compared to state-of-the-art segmentation techniques in unsupervised scenarios.
Paper Structure (26 sections, 11 equations, 5 figures, 3 tables)

This paper contains 26 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of the proposed idea. (a) Existing methods require mask annotations to train the model to achieve image-text lung infection region segmentation. (b) Our proposed method does not require mask annotation to achieve image-text unsupervised lung infection region segmentation by mining the valuable text attribute knowledge to learn statistical information.
  • Figure 2: An overview of the proposed AKGNet. First, a coarse mask is generated. Next, text attribute knowledge is extracted from text descriptions to construct the training target for the text attribute classifier. Then, the mask-guided image features extracted from the image encoder are fed into the classifier to compute $L_{a}$. The attribute-image fusion features generated by AICA module are fed into the image decoder to generate a prediction mask, which is used to compute $L_{c}$ with the coarse mask. Finally, the self-training mask refining process is implemented by computing $L_{st}$.
  • Figure 3: Ablation study of the (a) impact of different attributes in $L_{a}$; (b) impact of using mask-guided intermediate features or original intermediate features; (c) impact of the threshold in self-training $\delta$. We report the Dice values.
  • Figure 4: Impact of the weight of different losses (a) $\lambda_{c}$; (b) $\lambda_{a}$ and (c) $\lambda_{st}$. We report the Dice values.
  • Figure 5: Visualization examples of different methods. The left two columns represent input images and ground-truths. The rightmost column represents the segmentation results of our proposed AKGNet, and the remaining two columns represent the results of UNet and LViT.