Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Shiyun Chen; Li Lin; Pujin Cheng; ZhiCheng Jin; JianJian Chen; HaiDong Zhu; Kenneth K. Y. Wong; Xiaoying Tang

Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Shiyun Chen, Li Lin, Pujin Cheng, ZhiCheng Jin, JianJian Chen, HaiDong Zhu, Kenneth K. Y. Wong, Xiaoying Tang

TL;DR

Diff4MMLiTS addresses the challenge of unregistered multimodal liver CT data by introducing a four-stage pipeline that first generates normal CTs via inpainting, then synthesizes strictly aligned multimodal CTs with tumors using latent diffusion models, and finally trains a multimodal segmenter on a hybrid real-synthetic dataset. The approach eliminates reliance on perfectly aligned data and demonstrates superior performance on mmLiTs and generalization to LiTS, with notable gains across backbones and data efficiency. Key contributions include the Normal CT Generator, the Latent Diffusion–based Multimodal CT Synthesizer, and a hybrid training regimen that leverages synthetic data to enhance segmentation accuracy. This work has practical impact by enabling robust liver tumor segmentation in real-world clinical scenarios where multimodal alignment is imperfect, potentially improving diagnostic precision and treatment planning.

Abstract

Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality's mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods.

Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 2 figures, 5 tables)

This paper contains 14 sections, 4 equations, 2 figures, 5 tables.

Introduction
Related Work
Method
Normal CT Generator (NCG) Module
Multimodal CT Synthesizer (MCS) Module
Multimodal Segmentation (MS) Module
Experiments and Results
Datasets
Implementation Details
Overall Performance
Generalizability and Adaptability of Diff4MMLiTS
Effectiveness of Multimodal Synthesis Strategy
Contribution of Different Modules to the Framework
Conclusion

Figures (2)

Figure 1: The architecture of Diff4MMLiTS. Normal CT Generator (NCG) module uses the extended PVP mask to inpaint multimodal images to acquire normal CTs. Multimodal CT Synthesizer (MCS) module uses normal CTs to synthesize multimodal CTs. Multimodal Segmenter (MS) module trains segmenter using real and synthetic data. MCS comprises two components: a 3D autoencoder consisting of a CT feature encoder and decoder, a tumor synthesizer based on a diffusion model (DM).
Figure 2: Qualitative visualization results.

Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

TL;DR

Abstract

Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (2)