FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Yuyue Zhou, Jessica Knight, Shrimanti Ghosh, Banafshe Felfeliyan, Jacob L. Jaremko, Abhilash R. Hareendranathan
TL;DR
This work addresses the challenge of segmenting elbow and wrist bones in pediatric musculoskeletal ultrasound with limited labeled data. It introduces FlexICL, a flexible visual in-context learning framework built on a ViT-Base encoder and a lightweight decoder, coupled with a suite of image-concatenation augmentations and masking strategies within a SimMIM-based self-supervised learning paradigm. Through extensive intra-video evaluation across four US datasets, FlexICL achieves robust, high-accuracy bone segmentation using only about 5% of frames as labeled data, and consistently outperforms state-of-the-art visual ICL methods (Painter, MAE-VQGAN) and conventional segmentation models (U-Net, TransUNet). The approach offers a scalable, low-label solution for real-time US interpretation in pediatric musculoskeletal trauma, with strong potential for broader clinical adoption and future high-resolution extensions.
Abstract
Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.
