Table of Contents
Fetching ...

Review of Zero-Shot and Few-Shot AI Algorithms in The Medical Domain

Maged Badawi, Mohammedyahia Abushanab, Sheethal Bhat, Andreas Maier

TL;DR

This survey addresses the data-scarce setting of medical imaging by reviewing zero-shot and few-shot object detection alongside regular detectors. It surveys methods that leverage vision-language models and semantic embeddings (e.g., CLIP-based alignment) to recognize and localize unseen or rare medical objects, including ZSD-YOLO, GTNet, and GRAN, and discusses related prompts and self-training strategies. Key findings indicate that semantic alignment, feature synthesis, and contextual reasoning can substantially improve detection of unseen classes and generalization, with metrics such as mAP and AUROC showing gains across medical and natural datasets. Despite progress, the review notes limited discussion of development-time challenges and advocates for deeper analyses of domain-specific limitations, broader adoption of VLPMs, and more robust, domain-adapted evaluations to guide future work.

Abstract

In this paper, different techniques of few-shot, zero-shot, and regular object detection have been investigated. The need for few-shot learning and zero-shot learning techniques is crucial and arises from the limitations and challenges in traditional machine learning, deep learning, and computer vision methods where they require large amounts of data, plus the poor generalization of those traditional methods. Those techniques can give us prominent results by using only a few training sets reducing the required amounts of data and improving the generalization. This survey will highlight the recent papers of the last three years that introduce the usage of few-shot learning and zero-shot learning techniques in addressing the challenges mentioned earlier. In this paper we reviewed the Zero-shot, few-shot and regular object detection methods and categorized them in an understandable manner. Based on the comparison made within each category. It been found that the approaches are quite impressive. This integrated review of diverse papers on few-shot, zero-shot, and regular object detection reveals a shared focus on advancing the field through novel frameworks and techniques. A noteworthy observation is the scarcity of detailed discussions regarding the difficulties encountered during the development phase. Contributions include the introduction of innovative models, such as ZSD-YOLO and GTNet, often showcasing improvements with various metrics such as mean average precision (mAP),Recall@100 (RE@100), the area under the receiver operating characteristic curve (AUROC) and precision. These findings underscore a collective move towards leveraging vision-language models for versatile applications, with potential areas for future research including a more thorough exploration of limitations and domain-specific adaptations.

Review of Zero-Shot and Few-Shot AI Algorithms in The Medical Domain

TL;DR

This survey addresses the data-scarce setting of medical imaging by reviewing zero-shot and few-shot object detection alongside regular detectors. It surveys methods that leverage vision-language models and semantic embeddings (e.g., CLIP-based alignment) to recognize and localize unseen or rare medical objects, including ZSD-YOLO, GTNet, and GRAN, and discusses related prompts and self-training strategies. Key findings indicate that semantic alignment, feature synthesis, and contextual reasoning can substantially improve detection of unseen classes and generalization, with metrics such as mAP and AUROC showing gains across medical and natural datasets. Despite progress, the review notes limited discussion of development-time challenges and advocates for deeper analyses of domain-specific limitations, broader adoption of VLPMs, and more robust, domain-adapted evaluations to guide future work.

Abstract

In this paper, different techniques of few-shot, zero-shot, and regular object detection have been investigated. The need for few-shot learning and zero-shot learning techniques is crucial and arises from the limitations and challenges in traditional machine learning, deep learning, and computer vision methods where they require large amounts of data, plus the poor generalization of those traditional methods. Those techniques can give us prominent results by using only a few training sets reducing the required amounts of data and improving the generalization. This survey will highlight the recent papers of the last three years that introduce the usage of few-shot learning and zero-shot learning techniques in addressing the challenges mentioned earlier. In this paper we reviewed the Zero-shot, few-shot and regular object detection methods and categorized them in an understandable manner. Based on the comparison made within each category. It been found that the approaches are quite impressive. This integrated review of diverse papers on few-shot, zero-shot, and regular object detection reveals a shared focus on advancing the field through novel frameworks and techniques. A noteworthy observation is the scarcity of detailed discussions regarding the difficulties encountered during the development phase. Contributions include the introduction of innovative models, such as ZSD-YOLO and GTNet, often showcasing improvements with various metrics such as mean average precision (mAP),Recall@100 (RE@100), the area under the receiver operating characteristic curve (AUROC) and precision. These findings underscore a collective move towards leveraging vision-language models for versatile applications, with potential areas for future research including a more thorough exploration of limitations and domain-specific adaptations.

Paper Structure

This paper contains 8 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An overview of the proposed training method of ZSD-YOLO. The method aligns detector semantic outputs to vision and language embeddings from a pre-trained Vision-Language model such as CLIP. YOLOv5 has been modified to replace typical class outputs with a semantic output with a shape equal to the CLIP model embedding size and then align predicted semantic outputs of positively matched anchors with corresponding ground truth text embeddings with a modified cross-entropy loss. Image Embeddings of positively matched anchors are aligned using a modified L1 loss function. Best viewed in colorxie2022zero.
  • Figure 2: Illustration of the IoU-Aware Generative Adversarial Network (IoUGAN). The Class Feature Generating Unit (CFU) takes the class embeddings and the random noise vectors as input and outputs the features with the intra-class variance. Then the Foreground Feature Generating Unit (FFU) and the Background Feature Generating Unit (BFU) add the loU variance to the results of CFU and output the class-specific foreground and background features, respectively. zhao2020gtnet
  • Figure 3: depicts the overall structure of the proposed model (TRMFCN) chen2020triple, which is composed of encode and decode process. Encode process includes: traditional 2d convolution, maxpooling and residual multiscale (RM) block. Decode process includes: deconvolution, residuals multiscale (RM) block, concatenate block and traditional 2d convolution. The RM block is inspired by ResNet and Inception V1. The creation of this concatenate block is inherited from U-Net.