Table of Contents
Fetching ...

Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations

Chengxi Zeng, David Smithard, Alberto M Gambaruto, Tilo Burghardt

TL;DR

The paper addresses adapting vision foundation models to VFSS segmentation without full annotations by introducing Prompt-TTT, a test-time training paradigm that uses point prompts as an auxiliary self-supervised signal. The method employs a two-branch architecture with a shared encoder and decoders for box-prompted segmentation and point-prompt auxiliary tasks, coupled with a consistency loss across augmentations and prompts. On the VFSS-5k dataset covering 12 anatomies, the approach achieves an average Dice score of $DSC=0.868$, narrowing the gap to specialist models and outperforming prior TTT variants. This prompt-driven, temporally aware framework offers an annotation-efficient path to robust medical video segmentation with practical clinical impact.

Abstract

Vision foundation models have demonstrated exceptional generalization capabilities in segmentation tasks for both generic and specialized images. However, a performance gap persists between foundation models and task-specific, specialized models. Fine-tuning foundation models on downstream datasets is often necessary to bridge this gap. Unfortunately, obtaining fully annotated ground truth for downstream datasets is both challenging and costly. To address this limitation, we propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task. The model learns by resolving the ambiguity of the point prompt through various augmentations. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive. We conducted extensive experiments on our new Videofluoroscopy dataset (VFSS-5k) for the instance segmentation task, achieving an average Dice coefficient of 0.868 across 12 anatomies with a single model.

Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations

TL;DR

The paper addresses adapting vision foundation models to VFSS segmentation without full annotations by introducing Prompt-TTT, a test-time training paradigm that uses point prompts as an auxiliary self-supervised signal. The method employs a two-branch architecture with a shared encoder and decoders for box-prompted segmentation and point-prompt auxiliary tasks, coupled with a consistency loss across augmentations and prompts. On the VFSS-5k dataset covering 12 anatomies, the approach achieves an average Dice score of , narrowing the gap to specialist models and outperforming prior TTT variants. This prompt-driven, temporally aware framework offers an annotation-efficient path to robust medical video segmentation with practical clinical impact.

Abstract

Vision foundation models have demonstrated exceptional generalization capabilities in segmentation tasks for both generic and specialized images. However, a performance gap persists between foundation models and task-specific, specialized models. Fine-tuning foundation models on downstream datasets is often necessary to bridge this gap. Unfortunately, obtaining fully annotated ground truth for downstream datasets is both challenging and costly. To address this limitation, we propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task. The model learns by resolving the ambiguity of the point prompt through various augmentations. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive. We conducted extensive experiments on our new Videofluoroscopy dataset (VFSS-5k) for the instance segmentation task, achieving an average Dice coefficient of 0.868 across 12 anatomies with a single model.

Paper Structure

This paper contains 16 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Schematic diagram. (a) Training time workflow – the image is processed in two tasks: the main task (box-prompted) and the auxiliary task (point-prompted). (b) Our TTT strategy leverages the resolution of ambiguity generated from different point prompts, in comparison to other TTT methods such as (c) Rotation Prediction and (d) Masked Image Reconstruction.
  • Figure 2: Quantitative Results. Comparing specialist models with the foundation model MedSAM and with our TTT strategies in three consecutive frames.