CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention
Yaxiong Chen, Minghong Wei, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
TL;DR
CausalCLIPSeg tackles referring medical image segmentation by marrying CLIP-based text-vision encoders with a cross-modal decoder and a causal intervention module to suppress confounding features. The model leverages a structural causal model and adversarial masking to separate causal lesion cues from spurious context, trained with a min-max objective. On QaTa-COV19, it achieves state-of-the-art Dice and mIoU scores, outperforming vision-only and prior multi-modal methods, and ablation studies confirm the value of both CLIP pretraining and the causal module. The work demonstrates the potential of transferring large-scale vision-language priors to medical segmentation while addressing spurious correlations, with implications for robust multi-modal medical analysis.
Abstract
Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.
