SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

Lingdong Shen; Fangxin Shang; Xiaoshuang Huang; Yehui Yang; Haifeng Huang; Shiming Xiang

SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

Lingdong Shen, Fangxin Shang, Xiaoshuang Huang, Yehui Yang, Haifeng Huang, Shiming Xiang

TL;DR

SegICL tackles medical image segmentation under Out-of-Distribution (OOD) conditions by introducing a multimodal in-context learning framework that combines text prompts with a small set of image–mask examples, enabling training-free adaptation to unseen modalities. It integrates a large-language-model–based multimodal encoder with a diffusion-based image decoder to produce segmentation masks, guided by in-context cues and a lightweight condition encoder. Across optic fundus, CT, and MRI datasets, SegICL demonstrates that segmentation performance improves with more in-context prompts and achieves competitive results on in-distribution tasks, while offering practical advantages for cold-start annotation and cross-modal generalization. The work highlights a practical path toward cost-efficient, user-friendly, and adaptable medical image segmentation in multi-modal environments.

Abstract

In the field of medical image segmentation, tackling Out-of-Distribution (OOD) segmentation tasks in a cost-effective manner remains a significant challenge. Universal segmentation models is a solution, which aim to generalize across the diverse modality of medical images, yet their effectiveness often diminishes when applied to OOD data modalities and tasks, requiring intricate fine-tuning of model for optimal performance. Few-shot learning segmentation methods are typically designed for specific modalities of data and cannot be directly transferred for use with another modality. Therefore, we introduce SegICL, a novel approach leveraging In-Context Learning (ICL) for image segmentation. Unlike existing methods, SegICL has the capability to employ text-guided segmentation and conduct in-context learning with a small set of image-mask pairs, eliminating the need for training the model from scratch or fine-tuning for OOD tasks (including OOD modality and dataset). Extensive experimental demonstrates a positive correlation between the number of shots and segmentation performance on OOD tasks. The performance of segmentation when provided thre-shots is approximately 1.5 times better than the performance in a zero-shot setting. This indicates that SegICL effectively address new segmentation tasks based on contextual information. Additionally, SegICL also exhibits comparable performance to mainstream models on OOD and in-distribution tasks. Our code will be released after paper review.

SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 9 figures, 1 table)

This paper contains 17 sections, 6 equations, 9 figures, 1 table.

Introduction
Related work
Universal medical image Segmentation
Learning from limited data samples
Methodology
In-Context Learning Paradigm of SegICL
Architecture
Multi-modal Encoder
Image Decoder
Experiments
Dataset
Implementation details
Performance of the OOD modality
Performance of the OOD dataset
Performance of the In-distribution datasets
...and 2 more sections

Figures (9)

Figure 1: The example of inference pipeline of SegICL. The left part demonstrates SegICL can following text instructions. The right part demonstrates SegICL can segment OOD data with a few image-mask pairs and text instructions.
Figure 2: The overall structure of the SegICL paradigm. SegICL-0 represents zero-shot inference, while SegICL-x represents using x image-mask pairs as prompts for inference.
Figure 3: Training pipeline of SegICL. Module A is a image decoder, and Module B is a multimodal encoder. Interleaved multimodal text and image input undergoes encoding in Module B (get the hidden variable H ), and the projected data (get the result state S) is then passed to Module A for decoding, ultimately generating the corresponding mask.
Figure 4: Diagram of the differences between different base models. Different base models possess varying degrees of prior knowledge. Thanks to pre-trained weights, the model can roughly localize the target even in a zero-shot scenario.
Figure 5: Performance comparison of SegICL on OOD modality. The positive correlation can be observed between the number of prompt samples (SegICL-x) and segmentation performance. Although SegICL-3 doesn't match the SOTA models, its train-free results are still adequate for assisting cold-start in semi-automatic annotation.
...and 4 more figures

SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

TL;DR

Abstract

SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

Authors

TL;DR

Abstract

Table of Contents

Figures (9)