Table of Contents
Fetching ...

VOILA: Complexity-Aware Universal Segmentation of CT images by Voxel Interacting with Language

Zishuo Wan, Yu Gao, Wanyuan Pang, Dawei Ding

TL;DR

VOILA addresses the challenge of universal CT segmentation by aligning per-voxel representations with language in a shared space using cosine similarity, and by mitigating class imbalance and computational cost through a Voxel-Language Interaction framework and a Complexity-Aware Self-Supervised Sampling module. A voxel-centric approach pairs a CLIP-based text encoder with a voxel encoder, enriched prompts, and a CAS mechanism that concentrates learning on hard-to-segment regions via a CVAE-generated complexity heatmap, reducing the need for large fully connected classifiers. The method achieves competitive performance across seven public datasets, particularly excelling as the number of classes grows, while requiring fewer parameters and training resources and demonstrating strong generalization without fine-tuning. The work advances practical universal segmentation for CT imaging by combining voxel-level contrastive learning with cross-modal prompts and self-supervised hard-sample mining, enabling scalable, data-efficient segmentation across diverse datasets.

Abstract

Satisfactory progress has been achieved recently in universal segmentation of CT images. Following the success of vision-language methods, there is a growing trend towards utilizing text prompts and contrastive learning to develop universal segmentation models. However, there exists a significant imbalance in information density between 3D images and text prompts. Moreover, the standard fully connected layer segmentation approach faces significant challenges in handling multiple classes and exhibits poor generalizability. To address these challenges, we propose the VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. Initially, we align voxels and language into a shared representation space and classify voxels on the basis of cosine similarity. Subsequently, we develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes. Furthermore, a Complexity-Aware Sampling method is proposed to focus on region hard to segment, achieved by generating pseudo-heatmaps from a trainable Gaussian mixture distribution. Our results indicate the proposed VOILA is capable to achieve improved performance with reduced parameters and computational cost during training. Furthermore, it demonstrates significant generalizability across diverse datasets without additional fine-tuning.

VOILA: Complexity-Aware Universal Segmentation of CT images by Voxel Interacting with Language

TL;DR

VOILA addresses the challenge of universal CT segmentation by aligning per-voxel representations with language in a shared space using cosine similarity, and by mitigating class imbalance and computational cost through a Voxel-Language Interaction framework and a Complexity-Aware Self-Supervised Sampling module. A voxel-centric approach pairs a CLIP-based text encoder with a voxel encoder, enriched prompts, and a CAS mechanism that concentrates learning on hard-to-segment regions via a CVAE-generated complexity heatmap, reducing the need for large fully connected classifiers. The method achieves competitive performance across seven public datasets, particularly excelling as the number of classes grows, while requiring fewer parameters and training resources and demonstrating strong generalization without fine-tuning. The work advances practical universal segmentation for CT imaging by combining voxel-level contrastive learning with cross-modal prompts and self-supervised hard-sample mining, enabling scalable, data-efficient segmentation across diverse datasets.

Abstract

Satisfactory progress has been achieved recently in universal segmentation of CT images. Following the success of vision-language methods, there is a growing trend towards utilizing text prompts and contrastive learning to develop universal segmentation models. However, there exists a significant imbalance in information density between 3D images and text prompts. Moreover, the standard fully connected layer segmentation approach faces significant challenges in handling multiple classes and exhibits poor generalizability. To address these challenges, we propose the VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. Initially, we align voxels and language into a shared representation space and classify voxels on the basis of cosine similarity. Subsequently, we develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes. Furthermore, a Complexity-Aware Sampling method is proposed to focus on region hard to segment, achieved by generating pseudo-heatmaps from a trainable Gaussian mixture distribution. Our results indicate the proposed VOILA is capable to achieve improved performance with reduced parameters and computational cost during training. Furthermore, it demonstrates significant generalizability across diverse datasets without additional fine-tuning.
Paper Structure (24 sections, 6 equations, 6 figures, 4 tables)

This paper contains 24 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of VOILA. (a) The overall workflow of VOILA. When taking CT images and text prompts as inputs, the encoders extract their representation tokens respectively. The voxels are selected by (b) Complexity-Aware Sampling module. Finally, the tokens interact across modalities in (c) Voxel-Language Interaction module for classification.
  • Figure 2: Cosine similarities of text tokens extracted for the text encoder. The additional text prompt in this paper include more cross-text interactions.
  • Figure 3: (a) The pseudo heatmap generated by the CVAE in the Complexity-Aware Sampling module. Then the CAS module samples voxels with different sampling rate (b)-(f).
  • Figure 4: Example results for heatmap generated by the CVAE in the CAS module. (a) The groundtruth label. Heatmaps (b)-(f) are selected sequentially throughout the entire training process. The entire training phase involves a sampling process that begins with a randomly discrete pattern, gradually aggregates at key locations, and then disperses into finer details.
  • Figure 5: The visual comparison of 3 methods on Totoalsegmentator-v2.
  • ...and 1 more figures