Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

Jie Liu; Yixiao Zhang; Kang Wang; Mehmet Can Yavuz; Xiaoxi Chen; Yixuan Yuan; Haoliang Li; Yang Yang; Alan Yuille; Yucheng Tang; Zongwei Zhou

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, Zongwei Zhou

TL;DR

The work tackles the problem of scalable organ segmentation and tumor detection across partially labeled abdominal CT datasets by introducing a universal model that couples a vision backbone with a language-driven parameter generator. The CLIP-Driven Universal Model uses language embeddings to generate per-class parameters and lightweight class-specific heads, enabling simultaneous segmentation of 25 organs and 6 tumor types with strong generalization and continual-learning capabilities. It achieves leading performance on MSD and BTCV benchmarks, remains efficient (≈6x faster than dataset-specific models), and demonstrates effective extension to new classes via minimal parameter growth and pseudo-labeling to mitigate forgetting. These results imply substantial practical impact for multi-dataset medical imaging, reducing annotation burdens and enabling rapid adaptation to new anatomical structures and pathologies across institutions and modalities.

Abstract

The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and learning scheme. To overcome these limitations, we propose a universal, extensible framework enabling a single model, termed Universal Model, to deal with multiple public datasets and adapt to new classes (e.g., organs/tumors). Firstly, we introduce a novel language-driven parameter generator that leverages language embeddings from large language models, enriching semantic encoding compared with one-hot encoding. Secondly, the conventional output layers are replaced with lightweight, class-specific heads, allowing Universal Model to simultaneously segment 25 organs and six types of tumors and ease the addition of new classes. We train our Universal Model on 3,410 CT volumes assembled from 14 publicly available datasets and then test it on 6,173 CT volumes from four external datasets. Universal Model achieves first place on six CT tasks in the Medical Segmentation Decathlon (MSD) public leaderboard and leading performance on the Beyond The Cranial Vault (BTCV) dataset. In summary, Universal Model exhibits remarkable computational efficiency (6x faster than other dataset-specific models), demonstrates strong generalization across different hospitals, transfers well to numerous downstream tasks, and more importantly, facilitates the extensibility to new classes while alleviating the catastrophic forgetting of previously learned classes. Codes, models, and datasets are available at https://github.com/ljwztc/CLIP-Driven-Universal-Model

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

TL;DR

Abstract

Paper Structure (39 sections, 5 equations, 7 figures, 11 tables)

This paper contains 39 sections, 5 equations, 7 figures, 11 tables.

Introduction
Related Work
Organ Segmentation and Tumor Detection
Large Language Vision Model
Incremental Learning
Learning with Integrated Datasets
Our Previous Work
Methodology
Problem Definition
CLIP-Driven Universal Model
Backbone Network
Language-driven Parameter Generator
Class-specific Segment Head
Optimization
Extended to Novel Classes
...and 24 more sections

Figures (7)

Figure 1: Overview. We have developed the continual CLIP-Driven Universal Model from an assembly of 14 public datasets of 3,410 CT volumes. In total, 25 organs and 6 types of tumors are partially labeled. To deal with partial labels, Universal Model consists of a language branch and a vision branch (§\ref{['sec:universal_model']}). The official test set of MSD and BTCV are used to benchmark the performance of organ segmentation (§\ref{['sec:strong_challenge_ranking']}) and tumor detection (§\ref{['sec:high_specificity']}). 3D-IRCADb, TotalSegmentator and a large-scale private dataset, consisting of 5,038 CT volumes with 21 annotated organs, are used for independent, external validation of model generalizability (§\ref{['sec:generalizability']}) and transferability (§\ref{['sec:transfer_learning']}). The LPG module utilizes separate MLPs for each organ to overcome the entanglement issue present in the ICCV version liu2023clip, which relied on a single MLP.
Figure 2: Qualitative results of multi-tumor detection and segmentation. We review the detection/segmentation results of each tumor type from smaller to larger sizes. Especially, Universal Model generalizes well in organ segmentation and does not generate many false positives of tumors when it comes to a CT volume without tumors from other hospitals (Row 3).
Figure 3: Ablation study on different segmentation backbones. Universal Model can be expanded to Transformer-based (Swin UNETR) and CNN-based (U-Net, SegResNet-Tiny) backbone. These backbones achieve comparable results. The numbers of parameters of Swin UNETR, U-Net, and SegResNet-Tiny are 62.19M, 19.08M, and 4.7M, respectively. The order of classes is the same as \ref{['tab:label_index']}.
Figure 4: (a) Pseudo-label visualization. 25 organs and 6 tumors in four unseen datasets are visualized. (b) External validation for liver tumor detection. In Cases 1 and 2, Universal Model successfully identified small new liver tumors, that have been overlooked during radiological evaluation. In Cases 3 and 4, where multiple liver tumors were present, Universal Model detected them, resulting in improved diagnostic efficiency.
Figure 5: (a) Contour line comparison among pseudo labels and two human experts. The red line represents the annotation from Doctor 1; green line indicates the annotation from Doctor 2; blue line shows the results generated by Universal Model. Examples of CT volumes annotated by our pseudo labels and two human experts with contour line comparison. The prediction results of these organs generated by the medical model are comparable with human experts. (b) Intra-observer variability. We obtain similar performance between pseudo labels generated by Universal Model (AI) and annotations performed by two human experts (Dr1,2) on 6 organs. Spleen (Spl), liver (Liv), kidneys (Kid), stomach (Sto), gallbladder (Gall), and pancreas (Pan) can be annotated by AI with a similar intra-observer variability to humans.
...and 2 more figures

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

TL;DR

Abstract

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

Authors

TL;DR

Abstract

Table of Contents

Figures (7)