Table of Contents
Fetching ...

Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging

Siyuan Dai, Kai Ye, Guodong Liu, Haoteng Tang, Liang Zhan

TL;DR

This work tackles the challenge of multimodal medical image segmentation without requiring paired vision-language data by introducing Zeus, a framework that generates zero-shot text instructions from multimodal medical images using frozen LVLMs and LLMs. The generated instructions guide a SAM-based mask decoder to perform union segmentation across multiple modalities, emulating the diagnostic reasoning of a physician. Evaluated on MSD-Prostate, MSD-Brain, and CHAOS, Zeus achieves state-of-the-art DSC and mIoU across early, hybrid, and late fusion schemes, while maintaining a lightweight, end-to-end pipeline with frozen encoders. The study demonstrates that cross-modal knowledge and instruction prompting can be leveraged to improve clinical segmentation tasks without costly data collection or extensive model fine-tuning, paving the way for practical deployment in multimodal medical imaging.

Abstract

Medical image segmentation has achieved remarkable success through the continuous advancement of UNet-based and Transformer-based foundation backbones. However, clinical diagnosis in the real world often requires integrating domain knowledge, especially textual information. Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming, posing significant challenges. Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues. Specifically, we introduce frozen LLMs for zero-shot instruction generation based on corresponding medical images, imitating the radiology scanning and report generation process. {To better approximate real-world diagnostic processes}, we generate more precise text instruction from multimodal radiology images (e.g., T1-w or T2-w MRI and CT). Based on the impressive ability of semantic understanding and rich knowledge of LLMs. This process emphasizes extracting special features from different modalities and reunion the information for the ultimate clinical diagnostic. With generated text instruction, our proposed union segmentation framework can handle multimodal segmentation without prior collected vision-language datasets. To evaluate our proposed method, we conduct comprehensive experiments with influential baselines, the statistical results and the visualized case study demonstrate the superiority of our novel method.}

Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging

TL;DR

This work tackles the challenge of multimodal medical image segmentation without requiring paired vision-language data by introducing Zeus, a framework that generates zero-shot text instructions from multimodal medical images using frozen LVLMs and LLMs. The generated instructions guide a SAM-based mask decoder to perform union segmentation across multiple modalities, emulating the diagnostic reasoning of a physician. Evaluated on MSD-Prostate, MSD-Brain, and CHAOS, Zeus achieves state-of-the-art DSC and mIoU across early, hybrid, and late fusion schemes, while maintaining a lightweight, end-to-end pipeline with frozen encoders. The study demonstrates that cross-modal knowledge and instruction prompting can be leveraged to improve clinical segmentation tasks without costly data collection or extensive model fine-tuning, paving the way for practical deployment in multimodal medical imaging.

Abstract

Medical image segmentation has achieved remarkable success through the continuous advancement of UNet-based and Transformer-based foundation backbones. However, clinical diagnosis in the real world often requires integrating domain knowledge, especially textual information. Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming, posing significant challenges. Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues. Specifically, we introduce frozen LLMs for zero-shot instruction generation based on corresponding medical images, imitating the radiology scanning and report generation process. {To better approximate real-world diagnostic processes}, we generate more precise text instruction from multimodal radiology images (e.g., T1-w or T2-w MRI and CT). Based on the impressive ability of semantic understanding and rich knowledge of LLMs. This process emphasizes extracting special features from different modalities and reunion the information for the ultimate clinical diagnostic. With generated text instruction, our proposed union segmentation framework can handle multimodal segmentation without prior collected vision-language datasets. To evaluate our proposed method, we conduct comprehensive experiments with influential baselines, the statistical results and the visualized case study demonstrate the superiority of our novel method.}

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The architecture of Zeus. It consists of a pre-trained vision-language model and a large language model with a pre-trained vision backbone as the prompt encoder. The trainable mask decoder accepts image-instruction pairs for mask prediction.
  • Figure 2: The pipeline of text instruction generation. It consists of a pre-trained vision-language model but only processes the image. A simple linear layer is used for knowledge transformation from a vision-language model to a pure large language model with text prompts.
  • Figure 3: Visualization of the multi-class organ segmentation results, bi-class prostate segmentation results, and bi-class brain tumor segmentation results.
  • Figure 4: Visualization of the bi-class prostate segmentation results, bi-class brain tumor segmentation results, and multi-class organ segmentation results.