Table of Contents
Fetching ...

AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Xingjian Li, Qifeng Wu, Adithya S. Ubaradka, Yiran Ding, Colleen Que, Runmin Jiang, Jianhua Xing, Tianyang Wang, Min Xu

TL;DR

AutoMiSeg tackles annotation-free, zero-shot medical image segmentation by composing a grounding module, visual-prompt boosting, and a promptable segmentation model, all guided by Learnable Test-time Adaptors optimized via Bayesian Optimization with a proxy validator. It leverages off-the-shelf foundation models for grounding and segmentation, enabling automatic task-specific prompts without training. Across seven diverse datasets, AutoMiSeg achieves a Dice improvement from $42.53$ to $71.81$, a $69\%$ relative gain, and reaches competitive performance with weak-prompt interactive models, highlighting a scalable path toward annotation-efficient clinical image analysis. The modular design and AutoML-driven adaptation make it a promising platform for extending zero-shot segmentation to broader medical imaging domains.

Abstract

Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline not only substantially surpasses the previously best-performing method, yielding a 69\% relative improvement in accuracy (Dice Score from 42.53 to 71.81), but also performs competitively with weakly-prompted interactive foundation models.

AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

TL;DR

AutoMiSeg tackles annotation-free, zero-shot medical image segmentation by composing a grounding module, visual-prompt boosting, and a promptable segmentation model, all guided by Learnable Test-time Adaptors optimized via Bayesian Optimization with a proxy validator. It leverages off-the-shelf foundation models for grounding and segmentation, enabling automatic task-specific prompts without training. Across seven diverse datasets, AutoMiSeg achieves a Dice improvement from to , a relative gain, and reaches competitive performance with weak-prompt interactive models, highlighting a scalable path toward annotation-efficient clinical image analysis. The modular design and AutoML-driven adaptation make it a promising platform for extending zero-shot segmentation to broader medical imaging domains.

Abstract

Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline not only substantially surpasses the previously best-performing method, yielding a 69\% relative improvement in accuracy (Dice Score from 42.53 to 71.81), but also performs competitively with weakly-prompted interactive foundation models.

Paper Structure

This paper contains 43 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Conceptually illustration of our proposed annotation-free and non-interactive paradigm for medical image segmentation (AutoMiSeg), compared with traditional supervised and interactive segmentation, which require considerable efforts of medical experts.
  • Figure 2: Overview of the proposed AutoMiSeg pipeline. Given a medical image and task description, a grounding module predicts a coarse region, refined by a prompt booster. A promptable segmentation model then generates the binary mask. A validator checks consistency with the task, while Learnable Test-time Adaptors (LTAs) adapt inputs and are optimized via Bayesian Optimization.
  • Figure 3: Qualitative visualization of our AutoMiSeg pipeline on two examples from the Kvasir Pogorelov:2017:KMI:3083187.3083212 and Busi walid_al-dhabyani_mohammed_gomaa_hussien_khaled_aly_fahmy_2024 dataset. The columns from left to right represent: (1) input image, (2) ground truth, (3) input image of the grounding model and its output, (4) input of the segmentation model and its output, and (5) input of the validator.
  • Figure 4: The left two figures show correlation between the normalized validation scores and true scores on the Kvasir Pogorelov:2017:KMI:3083187.3083212 Busi walid_al-dhabyani_mohammed_gomaa_hussien_khaled_aly_fahmy_2024 datasets with the searched hyperparameters. The right two figures present the process of Bayesian Optimization.
  • Figure 5: An example from the REFUGE orlando2020refuge dataset showing how the segmentation quality gradually improves in the optimization process.
  • ...and 2 more figures