Table of Contents
Fetching ...

Data Adaptive Few-shot Multi Label Segmentation with Foundation Model

Gurunath Reddy, Dattesh Shanbhag, Deepa Anand

TL;DR

This work proposes foundation model (FM) based adapters for single label, multi-label localization and segmentation to address concerns of sub-pixel level features of existing ViT based foundation models and demonstrates the efficacy of the proposed method for multiple segmentation and localization tasks.

Abstract

The high cost of obtaining accurate annotations for image segmentation and localization makes the use of one and few shot algorithms attractive. Several state-of-the-art methods for few-shot segmentation have emerged, including text-based prompting for the task but suffer from sub-optimal performance for medical images. Leveraging sub-pixel level features of existing Vision Transformer (ViT) based foundation models for identifying similar region of interest (RoI) based on a single template image have been shown to be very effective for one shot segmentation and localization in medical images across modalities. However, such methods rely on assumption that template image and test image are well matched and simple correlation is sufficient to obtain correspondences. In practice, however such an approach can fail to generalize in clinical data due to patient pose changes, inter-protocol variations even within a single modality or extend to 3D data using single template image. Moreover, for multi-label tasks, the RoI identification has to be performed sequentially. In this work, we propose foundation model (FM) based adapters for single label, multi-label localization and segmentation to address these concerns. We demonstrate the efficacy of the proposed method for multiple segmentation and localization tasks for both 2D and 3D data as we well as clinical data with different poses and evaluate against the state of the art few shot segmentation methods.

Data Adaptive Few-shot Multi Label Segmentation with Foundation Model

TL;DR

This work proposes foundation model (FM) based adapters for single label, multi-label localization and segmentation to address concerns of sub-pixel level features of existing ViT based foundation models and demonstrates the efficacy of the proposed method for multiple segmentation and localization tasks.

Abstract

The high cost of obtaining accurate annotations for image segmentation and localization makes the use of one and few shot algorithms attractive. Several state-of-the-art methods for few-shot segmentation have emerged, including text-based prompting for the task but suffer from sub-optimal performance for medical images. Leveraging sub-pixel level features of existing Vision Transformer (ViT) based foundation models for identifying similar region of interest (RoI) based on a single template image have been shown to be very effective for one shot segmentation and localization in medical images across modalities. However, such methods rely on assumption that template image and test image are well matched and simple correlation is sufficient to obtain correspondences. In practice, however such an approach can fail to generalize in clinical data due to patient pose changes, inter-protocol variations even within a single modality or extend to 3D data using single template image. Moreover, for multi-label tasks, the RoI identification has to be performed sequentially. In this work, we propose foundation model (FM) based adapters for single label, multi-label localization and segmentation to address these concerns. We demonstrate the efficacy of the proposed method for multiple segmentation and localization tasks for both 2D and 3D data as we well as clinical data with different poses and evaluate against the state of the art few shot segmentation methods.

Paper Structure

This paper contains 8 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The differences in image intensity distributions of a cohort of MRI data for knee and shoulder is shown here. We observe heterogeneity in distributions across cases which will necessitate manual threshold adaptation, which is overcome with our proposed approach.
  • Figure 2: The pixels within the RoI are considered as label 1 and outside as label 0. A simple classifier is trained to predict the labels of the pixels from the feature vectors, derived from the trained DINOv2 model.
  • Figure 3: We choose pairs of pixels from the RoI (red markers) as positive pair pixels. Whereas we choose a pixel from the RoI (red) and a pixel outside the RoI (blue) and pair them as negative pixels for contrastive learning. In our experiments, we have chosen the negative pixel 10 pixel away from the RoI but in our further experiments we found that choosing negative pixels anywhere in the non-RoI region do not hamper the results.
  • Figure 4: Contrastive similarity model for binary label localization (a): The contrastive model consists of two subnetworks. Pairs of positive pixels (red markers) are chosen from the RoI (shoulder tibia) and negative pairs - one from RoI and the other from outside the RoI (green marker). Feature vectors for pairs of pixels are derived from the finetuned DINOv2 ViT model. Feature vector pairs are passed to the network to learn the similarity measure for localization by minimizing the distance between the positive pairs and maximizing the distance between the negative pairs. The model is trained with cross-entropy loss function to obtain localization map to alleviate the thresholding. (b) Extension of proposed approach for multi-label localization: Multiple -labels for knee localization are shown here: TT (red box), patella (green box), and UPT (blue box). For each landmark paired positive and negative pixel pairs are sampled. The contrastive model is trained with these paired pixel features using cross-entropy loss.
  • Figure 5: Inference procedure for localization and segmentation: Feature vectors for pixels from RoI in template image are derived using ViT and treated as template/reference pixel feature vectors. Similarly, feature vectors for all pixels in target image are computed. Reference and target feature vectors are paired and given as input to contrastive model to obtain localization region. The output undergoes a connected component analysis to remove stray/isolated pixels. From the localized region, ten pixels are randomly chosen and used as prompts to SAM for refined segmentation
  • ...and 7 more figures