Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Caroline Magg; Maaike A. ter Wee; Johannes G. G. Dobbe; Geert J. Streekstra; Leendert Blankevoort; Clara I. Sánchez; Hoel Kervadec

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Caroline Magg, Maaike A. ter Wee, Johannes G. G. Dobbe, Geert J. Streekstra, Leendert Blankevoort, Clara I. Sánchez, Hoel Kervadec

TL;DR

It is concluded that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts.

Abstract

Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

TL;DR

It is concluded that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts.

Abstract

Paper Structure (68 sections, 2 equations, 17 figures, 16 tables)

This paper contains 68 sections, 2 equations, 17 figures, 16 tables.

Introduction
Challenge 1 -- Scalability of multi-rater evaluations
Challenge 2 -- Benchmarking fairness and data contamination
Challenge 3 -- Disconnect between benchmark dataset diversity and clinical requirements
Related Work
Segment Anything Model
Medical Foundation Models
Independent evaluation studies for Promptable Foundation Models
Methodology
Promptable Foundation Models
Prompting Strategies in 2D
Prompting Strategies in 3D
Dataset
Observer Study
Evaluation design
...and 53 more sections

Figures (17)

Figure 1: Prompting strategies in 2D consist of prompt primitives, i.e., bounding box (a) and/or center point (b), and component selections, i.e., including prompts from either the largest component (blue prompt) or all components (white and blue prompts). The 3D prompting strategies extend this concept with slice selection (c).
Figure 2: Dataset Overview: Our dataset consists of four subsets, i.e., Wrist, Lower Leg, Shoulder, and Hipwasserthal2023totalsegmentator. A subset of 404 axial slices was extracted based on constraints ensuring data coverage, diversity and comparability.
Figure 3: Spatial distribution of prompt placement per annotator per dataset. Each point corresponds to one annotator. It represents the mean deviations (mm) in the x- ($\Delta x$) and y-directions ($\Delta y$) of the center point (i.e., human (a) or extracted from the bounding box (b)), relative to the reference prompt at the origin (0,0). The same-colored (more transparent) ellipse (a) and rectangles (b) represent each annotator's intra-rater consistency ((a): ($\Delta x$, $\Delta y$), (b): ($\Delta w$, $\Delta h$)). Wrist shows the least localization errors and highest consistency, while Lower Leg and Hip show high localization errors and low intra- and inter-rater consistency.
Figure 4: Accumulated annotation time per annotator for all projects.
Figure 5: DSC vs. HD95 (mm) performance of all models (color-encoded) and three prompt types (symbol-encoded) for perfect prompts. Larger symbols highlight the smallest Pareto-optimal models.
...and 12 more figures

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

TL;DR

Abstract

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (17)