AKGNet: Attribute Knowledge-Guided Unsupervised Lung-Infected Area Segmentation
Qing En, Yuhong Guo
TL;DR
This work tackles unsupervised segmentation of lung-infected regions using image-text pairs without mask annotations. It introduces AKGNet, which combines text attribute knowledge learning, attribute-image cross-attention, and self-training mask refinement to exploit textual descriptions and spatial correlations for segmentation. The method employs a coarse mask from an unsupervised saliency model, a text-based attribute classifier with mask-guided features, and a cross-modal fusion mechanism, all optimized via a joint loss $L_{total}=\lambda_c L_c+\lambda_a L_a+\lambda_{st} L_{st}$. On the QaTa-COV19 dataset, AKGNet achieves state-of-the-art unsupervised segmentation performance (Dice up to around 53.8–55.5 and Jaccard around 41.8–43.7) with competitive parameter efficiency, demonstrating the value of explicit text attribute knowledge and cross-modal reasoning for medical image segmentation without mask annotations.
Abstract
Lung-infected area segmentation is crucial for assessing the severity of lung diseases. However, existing image-text multi-modal methods typically rely on labour-intensive annotations for model training, posing challenges regarding time and expertise. To address this issue, we propose a novel attribute knowledge-guided framework for unsupervised lung-infected area segmentation (AKGNet), which achieves segmentation solely based on image-text data without any mask annotation. AKGNet facilitates text attribute knowledge learning, attribute-image cross-attention fusion, and high-confidence-based pseudo-label exploration simultaneously. It can learn statistical information and capture spatial correlations between image and text attributes in the embedding space, iteratively refining the mask to enhance segmentation. Specifically, we introduce a text attribute knowledge learning module by extracting attribute knowledge and incorporating it into feature representations, enabling the model to learn statistical information and adapt to different attributes. Moreover, we devise an attribute-image cross-attention module by calculating the correlation between attributes and images in the embedding space to capture spatial dependency information, thus selectively focusing on relevant regions while filtering irrelevant areas. Finally, a self-training mask improvement process is employed by generating pseudo-labels using high-confidence predictions to iteratively enhance the mask and segmentation. Experimental results on a benchmark medical image dataset demonstrate the superior performance of our method compared to state-of-the-art segmentation techniques in unsupervised scenarios.
