VOILA: Complexity-Aware Universal Segmentation of CT images by Voxel Interacting with Language
Zishuo Wan, Yu Gao, Wanyuan Pang, Dawei Ding
TL;DR
VOILA addresses the challenge of universal CT segmentation by aligning per-voxel representations with language in a shared space using cosine similarity, and by mitigating class imbalance and computational cost through a Voxel-Language Interaction framework and a Complexity-Aware Self-Supervised Sampling module. A voxel-centric approach pairs a CLIP-based text encoder with a voxel encoder, enriched prompts, and a CAS mechanism that concentrates learning on hard-to-segment regions via a CVAE-generated complexity heatmap, reducing the need for large fully connected classifiers. The method achieves competitive performance across seven public datasets, particularly excelling as the number of classes grows, while requiring fewer parameters and training resources and demonstrating strong generalization without fine-tuning. The work advances practical universal segmentation for CT imaging by combining voxel-level contrastive learning with cross-modal prompts and self-supervised hard-sample mining, enabling scalable, data-efficient segmentation across diverse datasets.
Abstract
Satisfactory progress has been achieved recently in universal segmentation of CT images. Following the success of vision-language methods, there is a growing trend towards utilizing text prompts and contrastive learning to develop universal segmentation models. However, there exists a significant imbalance in information density between 3D images and text prompts. Moreover, the standard fully connected layer segmentation approach faces significant challenges in handling multiple classes and exhibits poor generalizability. To address these challenges, we propose the VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. Initially, we align voxels and language into a shared representation space and classify voxels on the basis of cosine similarity. Subsequently, we develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes. Furthermore, a Complexity-Aware Sampling method is proposed to focus on region hard to segment, achieved by generating pseudo-heatmaps from a trainable Gaussian mixture distribution. Our results indicate the proposed VOILA is capable to achieve improved performance with reduced parameters and computational cost during training. Furthermore, it demonstrates significant generalizability across diverse datasets without additional fine-tuning.
