Table of Contents
Fetching ...

FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation

Yuyue Zhou, Jessica Knight, Shrimanti Ghosh, Banafshe Felfeliyan, Jacob L. Jaremko, Abhilash R. Hareendranathan

TL;DR

This work addresses the challenge of segmenting elbow and wrist bones in pediatric musculoskeletal ultrasound with limited labeled data. It introduces FlexICL, a flexible visual in-context learning framework built on a ViT-Base encoder and a lightweight decoder, coupled with a suite of image-concatenation augmentations and masking strategies within a SimMIM-based self-supervised learning paradigm. Through extensive intra-video evaluation across four US datasets, FlexICL achieves robust, high-accuracy bone segmentation using only about 5% of frames as labeled data, and consistently outperforms state-of-the-art visual ICL methods (Painter, MAE-VQGAN) and conventional segmentation models (U-Net, TransUNet). The approach offers a scalable, low-label solution for real-time US interpretation in pediatric musculoskeletal trauma, with strong potential for broader clinical adoption and future high-resolution extensions.

Abstract

Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.

FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation

TL;DR

This work addresses the challenge of segmenting elbow and wrist bones in pediatric musculoskeletal ultrasound with limited labeled data. It introduces FlexICL, a flexible visual in-context learning framework built on a ViT-Base encoder and a lightweight decoder, coupled with a suite of image-concatenation augmentations and masking strategies within a SimMIM-based self-supervised learning paradigm. Through extensive intra-video evaluation across four US datasets, FlexICL achieves robust, high-accuracy bone segmentation using only about 5% of frames as labeled data, and consistently outperforms state-of-the-art visual ICL methods (Painter, MAE-VQGAN) and conventional segmentation models (U-Net, TransUNet). The approach offers a scalable, low-label solution for real-time US interpretation in pediatric musculoskeletal trauma, with strong potential for broader clinical adoption and future high-resolution extensions.

Abstract

Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.

Paper Structure

This paper contains 37 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) Elbow US image showing humerus, region with effusion (inside the red box), a fracture (inside the yellow box), trochlea or capitellum, and ulna or olecranon. (b) Wrist US image showing bony regions like metaphysis, epiphysis, and a fracture (inside the yellow box).
  • Figure 2: The proposed low-annotation intra-video segmentation framework for musculoskeletal US. Sparse manual annotations are required (shown in green), and the model automatically generates segmentations for the remaining frames (shown in blue).
  • Figure 3: Dataset splitting and preprocessing. (a) Visual ICL model. 5% of frames were randomly selected from each video as the training set and divided into support and query pools. Each support image–mask pair was randomly matched with a query image–mask pair for training. (b) Conventional segmentation model. Training and validation sets were constructed using the same sampling strategy as the Visual ICL model—randomly selecting 5% of frames from each video. A fixed random seed was used to ensure consistent sampling of training (and validation) sets across different models. Note that there is no image overlap between the training, validation and test sets for both visual ICL and conventional fully supervised models data splitting.
  • Figure 4: Different training strategies for FlexICL. (a) Pairwise augmentation: Illustrated with a 2× augmentation example, where support and query images are duplicated in their respective pools. (b) Image-wise augmentation: The original image undergoes a random crop or horizontal flip before concatenation. (c) Softmask and hardmask random masking: Demonstrating the application of random masking techniques.
  • Figure 5: FlexICL Model. We applied epoch-wise augmentation, reshuffling support-query pairings in each training epoch.
  • ...and 1 more figures