Table of Contents
Fetching ...

Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

Filippo Ruffini, Elena Mulero Ayllon, Linlin Shen, Paolo Soda, Valerio Guarrasi

TL;DR

This paper presents the first large-scale benchmark comparing CNNs and Foundation Models (FMs) for prognosis prediction in medical imaging, specifically COVID-19 chest X-ray outcomes, under data-scarcity and class-imbalance. It rigorously evaluates full fine-tuning, linear probing, and a range of parameter-efficient fine-tuning methods (LoRA, VeRA, BitFit, IA3) across diverse pretrained models, datasets, and few-shot scenarios. Key findings show CNNs with full fine-tuning perform robustly on small, imbalanced data, while FMs with PEFT compete on larger datasets but are highly sensitive to imbalance; in few-shot settings, linear probing often yields the most stable results. The study provides practical guidance on when to deploy CNNs vs FMs and which fine-tuning strategies offer favorable efficiency–performance trade-offs in real-world clinical contexts.

Abstract

Despite the significant potential of Foundation Models (FMs) in medical imaging, their application to prognosis prediction remains challenging due to data scarcity, class imbalance, and task complexity, which limit their clinical adoption. This study introduces the first structured benchmark to assess the robustness and efficiency of transfer learning strategies for FMs compared with convolutional neural networks (CNNs) in predicting COVID-19 patient outcomes from chest X-rays. The goal is to systematically compare finetuning strategies, both classical and parameter efficient, under realistic clinical constraints related to data scarcity and class imbalance, offering empirical guidance for AI deployment in clinical workflows. Four publicly available COVID-19 chest X-ray datasets were used, covering mortality, severity, and ICU admission, with varying sample sizes and class imbalances. CNNs pretrained on ImageNet and FMs pretrained on general or biomedical datasets were adapted using full finetuning, linear probing, and parameter-efficient methods. Models were evaluated under full data and few shot regimes using the Matthews Correlation Coefficient (MCC) and Precision Recall AUC (PR-AUC), with cross validation and class weighted losses. CNNs with full fine-tuning performed robustly on small, imbalanced datasets, while FMs with Parameter-Efficient Fine-Tuning (PEFT), particularly LoRA and BitFit, achieved competitive results on larger datasets. Severe class imbalance degraded PEFT performance, whereas balanced data mitigated this effect. In few-shot settings, FMs showed limited generalization, with linear probing yielding the most stable results. No single fine-tuning strategy proved universally optimal: CNNs remain dependable for low-resource scenarios, whereas FMs benefit from parameter-efficient methods when data are sufficient.

Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

TL;DR

This paper presents the first large-scale benchmark comparing CNNs and Foundation Models (FMs) for prognosis prediction in medical imaging, specifically COVID-19 chest X-ray outcomes, under data-scarcity and class-imbalance. It rigorously evaluates full fine-tuning, linear probing, and a range of parameter-efficient fine-tuning methods (LoRA, VeRA, BitFit, IA3) across diverse pretrained models, datasets, and few-shot scenarios. Key findings show CNNs with full fine-tuning perform robustly on small, imbalanced data, while FMs with PEFT compete on larger datasets but are highly sensitive to imbalance; in few-shot settings, linear probing often yields the most stable results. The study provides practical guidance on when to deploy CNNs vs FMs and which fine-tuning strategies offer favorable efficiency–performance trade-offs in real-world clinical contexts.

Abstract

Despite the significant potential of Foundation Models (FMs) in medical imaging, their application to prognosis prediction remains challenging due to data scarcity, class imbalance, and task complexity, which limit their clinical adoption. This study introduces the first structured benchmark to assess the robustness and efficiency of transfer learning strategies for FMs compared with convolutional neural networks (CNNs) in predicting COVID-19 patient outcomes from chest X-rays. The goal is to systematically compare finetuning strategies, both classical and parameter efficient, under realistic clinical constraints related to data scarcity and class imbalance, offering empirical guidance for AI deployment in clinical workflows. Four publicly available COVID-19 chest X-ray datasets were used, covering mortality, severity, and ICU admission, with varying sample sizes and class imbalances. CNNs pretrained on ImageNet and FMs pretrained on general or biomedical datasets were adapted using full finetuning, linear probing, and parameter-efficient methods. Models were evaluated under full data and few shot regimes using the Matthews Correlation Coefficient (MCC) and Precision Recall AUC (PR-AUC), with cross validation and class weighted losses. CNNs with full fine-tuning performed robustly on small, imbalanced datasets, while FMs with Parameter-Efficient Fine-Tuning (PEFT), particularly LoRA and BitFit, achieved competitive results on larger datasets. Severe class imbalance degraded PEFT performance, whereas balanced data mitigated this effect. In few-shot settings, FMs showed limited generalization, with linear probing yielding the most stable results. No single fine-tuning strategy proved universally optimal: CNNs remain dependable for low-resource scenarios, whereas FMs benefit from parameter-efficient methods when data are sufficient.

Paper Structure

This paper contains 34 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overview of the methodological framework. The pipeline is structured into five main stages: dataset, models, fine-tuning, training (w/ or w/o FSL) and inference/evaluation.
  • Figure 2: Distribution of performance scores across fine-tuning methods and datasets. Each box plot summarizes the mean performance of all models fine-tuned with a given technique on a specific dataset. Boxes indicate the inter-quartile range for all the mean result obtained from all the models fine-tuned with a specific technique on a specific dataset (divided by color), while the central line represents the median and whiskers extend to the minimum and maximum values.
  • Figure 3: This figure includes only the fine-tuning techniques applicable to both $CNN$ and $FM$ architecture families. The mean performance over the test set folds for all fine-tuning methods is represented by the symbol, following the color scheme: $\mathbf{LoRA_{r=4}}$, $\mathbf{LoRA_{r=8}}$, $\mathbf{LoRA_{r=16}}$; $\mathbf{BitFit}$, $\mathbf{LP}$. The FFT method is uniquely represented by the symbol.
  • Figure 4: This figure displays the mean over all datasets performances respect only the fine-tuning techniques applicable to both $CNN$ and $FM$ architecture families. The overall mean for the fine-tuning methods is represented by the symbol, following the color scheme: $\mathbf{LoRA_{r=4}}$, $\mathbf{LoRA_{r=8}}$, $\mathbf{LoRA_{r=16}}$; $\mathbf{BitFit}$, $\mathbf{LP}$. The FFT method is uniquely represented by the symbol.
  • Figure 5: This figure presents one subplot for each dataset, illustrating on the $y$ axis the MCC results and on the $x$ axis the percentage of models' parameters trained over the total. Fine tuning techniques and models have a double encoding systems, one based on colors as $\mathbf{LoRA_{r=4}}$, $\mathbf{LoRA_{r=8}}$, $\mathbf{LoRA_{r=16}}$, $\mathbf{VeRA_{r=4}}$, $\mathbf{VeRA_{r=8}}$, $\mathbf{VeRA_{r=16}}$, $\mathbf{BitFit}$, $\mathbf{IA^3}$, $\mathbf{LP}$, and $\mathbf{FFT}$; and one based on the model ($CLIP\text{-}Large$), (MedCLIP$_{v}$), ($BioMEdCLIP$), (DINOv2$_b$), (DINOv2$_s$), (DINOv2$_l$), (MedCLIP$_{c}$), and (PubMedCLIP).
  • ...and 7 more figures