Table of Contents
Fetching ...

A Simple Framework Uniting Visual In-context Learning with Masked Image Modeling to Improve Ultrasound Segmentation

Yuyue Zhou, Banafshe Felfeliyan, Shrimanti Ghosh, Jessica Knight, Fatima Alves-Pereira, Christopher Keen, Jessica Küpper, Abhilash Rakkunedeth Hareendranathan, Jacob L. Jaremko

TL;DR

This work tackles the limited labeled data problem in ultrasound (US) bone segmentation by introducing SimICL, a simple framework that unites visual in-context learning with masked image modeling (MIM) based on the SimMIM paradigm. It uses concatenated support-query image pairs with random masking and a ViT encoder to perform self-supervised segmentation, trained from scratch. On a wrist US dataset with 3,822 test images and limited annotations, SimICL achieves a Dice coefficient of $DC=0.96$ and IoU of $IoU=0.92$, outperforming state-of-the-art segmentation and visual ICL models, and demonstrating robustness on small datasets. The method reduces the need for manual labeling and has strong potential to facilitate AI-assisted US image analysis in clinical practice.

Abstract

Conventional deep learning models deal with images one-by-one, requiring costly and time-consuming expert labeling in the field of medical imaging, and domain-specific restriction limits model generalizability. Visual in-context learning (ICL) is a new and exciting area of research in computer vision. Unlike conventional deep learning, ICL emphasizes the model's ability to adapt to new tasks based on given examples quickly. Inspired by MAE-VQGAN, we proposed a new simple visual ICL method called SimICL, combining visual ICL pairing images with masked image modeling (MIM) designed for self-supervised learning. We validated our method on bony structures segmentation in a wrist ultrasound (US) dataset with limited annotations, where the clinical objective was to segment bony structures to help with further fracture detection. We used a test set containing 3822 images from 18 patients for bony region segmentation. SimICL achieved an remarkably high Dice coeffient (DC) of 0.96 and Jaccard Index (IoU) of 0.92, surpassing state-of-the-art segmentation and visual ICL models (a maximum DC 0.86 and IoU 0.76), with SimICL DC and IoU increasing up to 0.10 and 0.16. This remarkably high agreement with limited manual annotations indicates SimICL could be used for training AI models even on small US datasets. This could dramatically decrease the human expert time required for image labeling compared to conventional approaches, and enhance the real-world use of AI assistance in US image analysis.

A Simple Framework Uniting Visual In-context Learning with Masked Image Modeling to Improve Ultrasound Segmentation

TL;DR

This work tackles the limited labeled data problem in ultrasound (US) bone segmentation by introducing SimICL, a simple framework that unites visual in-context learning with masked image modeling (MIM) based on the SimMIM paradigm. It uses concatenated support-query image pairs with random masking and a ViT encoder to perform self-supervised segmentation, trained from scratch. On a wrist US dataset with 3,822 test images and limited annotations, SimICL achieves a Dice coefficient of and IoU of , outperforming state-of-the-art segmentation and visual ICL models, and demonstrating robustness on small datasets. The method reduces the need for manual labeling and has strong potential to facilitate AI-assisted US image analysis in clinical practice.

Abstract

Conventional deep learning models deal with images one-by-one, requiring costly and time-consuming expert labeling in the field of medical imaging, and domain-specific restriction limits model generalizability. Visual in-context learning (ICL) is a new and exciting area of research in computer vision. Unlike conventional deep learning, ICL emphasizes the model's ability to adapt to new tasks based on given examples quickly. Inspired by MAE-VQGAN, we proposed a new simple visual ICL method called SimICL, combining visual ICL pairing images with masked image modeling (MIM) designed for self-supervised learning. We validated our method on bony structures segmentation in a wrist ultrasound (US) dataset with limited annotations, where the clinical objective was to segment bony structures to help with further fracture detection. We used a test set containing 3822 images from 18 patients for bony region segmentation. SimICL achieved an remarkably high Dice coeffient (DC) of 0.96 and Jaccard Index (IoU) of 0.92, surpassing state-of-the-art segmentation and visual ICL models (a maximum DC 0.86 and IoU 0.76), with SimICL DC and IoU increasing up to 0.10 and 0.16. This remarkably high agreement with limited manual annotations indicates SimICL could be used for training AI models even on small US datasets. This could dramatically decrease the human expert time required for image labeling compared to conventional approaches, and enhance the real-world use of AI assistance in US image analysis.
Paper Structure (11 sections, 2 figures, 3 tables)

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: SimICL overview. (A) We constructed a new input image based on one support image/mask pair and one query image. (B) The random mask was added to the image and then the masked image was fed into the model.
  • Figure 2: Segmentation prediction on test images.