Towards Foundation Models and Few-Shot Parameter-Efficient Fine-Tuning for Volumetric Organ Segmentation

Julio Silva-Rodríguez; Jose Dolz; Ismail Ben Ayed

Towards Foundation Models and Few-Shot Parameter-Efficient Fine-Tuning for Volumetric Organ Segmentation

Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed

TL;DR

This work defines Few-Shot Efficient Fine-Tuning (FSEFT) for volumetric CT organ segmentation, addressing practical clinical constraints by combining supervised foundation-model pre-training with data- and parameter-efficient adaptation. It introduces privacy-preserving black-box Adapters and Spatial Adapters, plus transductive inference using anatomical priors, and demonstrates that a foundation model trained on 2,042 scans across 9 public datasets can transfer effectively in few-shot regimes, often outperforming full fine-tuning and standard PEFT in base tasks. Results on TotalSegmentator and FLARE'22 show strong few-shot performance gains, substantial efficiency benefits, and improved generalization to novel organs when combined with decoder tuning or proper head initialization. The study underscores the value of supervised pre-training for medical segmentation, highlights practical privacy-preserving adaptation options, and maps clear avenues for deploying foundation models in real-world clinical settings with limited data and computing resources.

Abstract

The recent popularity of foundation models and the pre-train-and-adapt paradigm, where a large-scale model is transferred to downstream tasks, is gaining attention for volumetric medical image segmentation. However, current transfer learning strategies devoted to full fine-tuning for transfer learning may require significant resources and yield sub-optimal results when the labeled data of the target task is scarce. This makes its applicability in real clinical settings challenging since these institutions are usually constrained on data and computational resources to develop proprietary solutions. To address this challenge, we formalize Few-Shot Efficient Fine-Tuning (FSEFT), a novel and realistic scenario for adapting medical image segmentation foundation models. This setting considers the key role of both data- and parameter-efficiency during adaptation. Building on a foundation model pre-trained on open-access CT organ segmentation sources, we propose leveraging Parameter-Efficient Fine-Tuning and black-box Adapters to address such challenges. Furthermore, novel efficient adaptation methodologies are introduced in this work, which include Spatial black-box Adapters that are more appropriate for dense prediction tasks and constrained transductive inference, leveraging task-specific prior knowledge. Our comprehensive transfer learning experiments confirm the suitability of foundation models in medical image segmentation and unveil the limitations of popular fine-tuning strategies in few-shot scenarios.

Towards Foundation Models and Few-Shot Parameter-Efficient Fine-Tuning for Volumetric Organ Segmentation

TL;DR

Abstract

Paper Structure (38 sections, 4 equations, 8 figures, 10 tables)

This paper contains 38 sections, 4 equations, 8 figures, 10 tables.

Introduction
Related Work
Pre-training transferable medical volumetric models
Fine-tuning to downstream tasks
Towards a more efficient adaptation of foundation models
Parameter-Efficient Fine-Tuning
Black-box adaptation
Proposed Setting
Foundation model training
Few-Shot Efficient Fine-Tuning
Towards efficient adaptation using few-labeled volumes
Parameter-Efficient Fine-Tuning
Black-box adaptation
Leveraging anatomical priors during adaptation
Experiments
...and 23 more sections

Figures (8)

Figure 1: Towards efficiently adapting medical volumetric foundation models. In this work, we introduce a foundation model for volumetric CT organ segmentation, pre-trained on 2,042 partially annotated scans. We then delve into clinically realistic settings to adapt these models, considering (a) data and (b) parameter efficiency. These motivations formalize our proposed transfer learning setting, Few-Shot Efficient Fine-Tuning (FSEFT), which leverages popular Parameter-Efficient Fine-Tuning methods and novel privacy-preserving black-box Adapters to address such real-world challenges. DSC: Dice similarity coefficient.
Figure 2: Few-Shot Efficient Fine-Tuning (FSEFT). A foundation model for volumetric medical images has been developed (see top, and Section \ref{['sec:pretraining']}). This model is pre-trained on a collection of CT volumes from nine open-access datasets, consisting of 2,042 scans with 29 partially labeled structures. Following Eq. \ref{['eq:foundation_model']}, the network is trained using a supervised learning objective, targeting organ segmentation. Then, a novel and realistic scenario is proposed, considering the resource limitations of clinical institutions for developing robust segmentation models. Concretely, this scenario assumes that: (i) adaptation should be performed by accessing only a few labeled volumes (see bottom left and Section \ref{['sec:fewshot']}), and (ii) the specialization of the foundation model to downstream tasks should require the least computational resources. For the latter, Parameter-Efficient Fine-Tuning (PEFT) techniques such as LoRA, BitFit, or AdaptFormer (see bottom center and Section \ref{['sec:peft']}) and black-box Adapters such as linear probing and a newly proposed Spatial Adapter (see bottom right and Section \ref{['sec:blackbox']}), are considered.
Figure 3: Leveraging anatomical priors. We propose integrating priors on the target structure size while adapting the pre-trained foundation model in a transductive fashion. Concretely, a black-box Adapter is trained to minimize a given segmentation loss on the few support annotated volumes, and an inequality constraint is applied during training to produce consistent organ sizes (see Eq. \ref{['eq:transductive_inference']}).
Figure 4: Qualitative evaluation of the transductive inference. Axial views of CT scans from TotalSegmentator dataset. The annotation/prediction masks of the target organ are in red. The effect of the black-box adaptation with K=5 shots in the presence/absence of transductive inference (TI) leveraging anatomic priors according to Section \ref{['subsec:method_ti']} is illustrated.
Figure 5: Few-shot adaptation on FLARE'22. The capabilities of PEFT and black-box Adapters for transferring the proposed supervised pre-trained foundation model are illustrated. Results using K=10 annotated support volumes for adaptation to the segmentation of nine base organs with domain shifts.
...and 3 more figures

Towards Foundation Models and Few-Shot Parameter-Efficient Fine-Tuning for Volumetric Organ Segmentation

TL;DR

Abstract

Towards Foundation Models and Few-Shot Parameter-Efficient Fine-Tuning for Volumetric Organ Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)