Table of Contents
Fetching ...

Few Shot Semantic Segmentation: a review of methodologies, benchmarks, and open challenges

Nico Catalano, Matteo Matteucci

TL;DR

Few-Shot Semantic Segmentation addresses the challenge of segmenting novel classes from very limited labeled data by integrating principles from few-shot learning with pixel-level segmentation. The paper surveys three main methodological strands—conditional networks, prototypical networks, and latent-space optimization—and discusses how foundation-model strategies, including prompt engineering, multimodal cues, and generalist models, are redefining the field. It aggregates standard benchmarks, datasets, and metrics while highlighting open challenges such as domain shift, cross-domain generalization, and continual learning, and it analyzes latent representations from GANs, contrastive learning, and VAEs. The review also emphasizes the practical promise of vision foundation models like SAM and CLIP for FSS, particularly in data-scarce domains relevant to medicine and agriculture, and it outlines directions for future research toward robust, scalable, and adaptable FSS systems.

Abstract

Semantic segmentation, vital for applications ranging from autonomous driving to robotics, faces significant challenges in domains where collecting large annotated datasets is difficult or prohibitively expensive. In such contexts, such as medicine and agriculture, the scarcity of training images hampers progress. Introducing Few-Shot Semantic Segmentation, a novel task in computer vision, which aims at designing models capable of segmenting new semantic classes with only a few examples. This paper consists of a comprehensive survey of Few-Shot Semantic Segmentation, tracing its evolution and exploring various model designs, from the more popular conditional and prototypical networks to the more niche latent space optimization methods, presenting also the new opportunities offered by recent foundational models. Through a chronological narrative, we dissect influential trends and methodologies, providing insights into their strengths and limitations. A temporal timeline offers a visual roadmap, marking key milestones in the field's progression. Complemented by quantitative analyses on benchmark datasets and qualitative showcases of seminal works, this survey equips readers with a deep understanding of the topic. By elucidating current challenges, state-of-the-art models, and prospects, we aid researchers and practitioners in navigating the intricacies of Few-Shot Semantic Segmentation and provide ground for future development.

Few Shot Semantic Segmentation: a review of methodologies, benchmarks, and open challenges

TL;DR

Few-Shot Semantic Segmentation addresses the challenge of segmenting novel classes from very limited labeled data by integrating principles from few-shot learning with pixel-level segmentation. The paper surveys three main methodological strands—conditional networks, prototypical networks, and latent-space optimization—and discusses how foundation-model strategies, including prompt engineering, multimodal cues, and generalist models, are redefining the field. It aggregates standard benchmarks, datasets, and metrics while highlighting open challenges such as domain shift, cross-domain generalization, and continual learning, and it analyzes latent representations from GANs, contrastive learning, and VAEs. The review also emphasizes the practical promise of vision foundation models like SAM and CLIP for FSS, particularly in data-scarce domains relevant to medicine and agriculture, and it outlines directions for future research toward robust, scalable, and adaptable FSS systems.

Abstract

Semantic segmentation, vital for applications ranging from autonomous driving to robotics, faces significant challenges in domains where collecting large annotated datasets is difficult or prohibitively expensive. In such contexts, such as medicine and agriculture, the scarcity of training images hampers progress. Introducing Few-Shot Semantic Segmentation, a novel task in computer vision, which aims at designing models capable of segmenting new semantic classes with only a few examples. This paper consists of a comprehensive survey of Few-Shot Semantic Segmentation, tracing its evolution and exploring various model designs, from the more popular conditional and prototypical networks to the more niche latent space optimization methods, presenting also the new opportunities offered by recent foundational models. Through a chronological narrative, we dissect influential trends and methodologies, providing insights into their strengths and limitations. A temporal timeline offers a visual roadmap, marking key milestones in the field's progression. Complemented by quantitative analyses on benchmark datasets and qualitative showcases of seminal works, this survey equips readers with a deep understanding of the topic. By elucidating current challenges, state-of-the-art models, and prospects, we aid researchers and practitioners in navigating the intricacies of Few-Shot Semantic Segmentation and provide ground for future development.
Paper Structure (22 sections, 9 equations, 7 figures, 7 tables)

This paper contains 22 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of various cv tasks. In a given image (depicted in picture a), there are multiple layers of interpretation possible. Object Detection (picture b) identifies the region containing a subject using bounding boxes. Semantic Segmentation (picture c) assigns a class label to each pixel without necessarily distinguishing different instances of the same subject class. Parts Segmentation (picture d) provides pixel-level annotation by segmenting the constituent parts of the intended subject, such as the face parts in this example. Instance Segmentation (picture e) distinguishes the subjects in the scene with a different segmentation mask for each one but does not necessarily assign a semantic class. Panoptic Segmentation (picture f) involves the semantic classification of every pixel in an image, coupled with the delineation of each unique subject instance mask. In the given image, two mopeds are identified with labels "Moped_1" and "Moped_2," illustrating the distinct instances within the same class.
  • Figure 2: Timeline illustrating the progression of the fss research field. Some of the most consequential models are depicted at the top, with colored bands representing design trends and influences. Overlapping and fading bands suggest the sharing of concepts between multiple works to varying degrees. Below, key milestones mark the definition of the fss problem and its variations.
  • Figure 3: Conditional networks adapted from rakelly2018conditional. Using the support set $S(l)$, the conditioning branch produces the parameter set $\theta$. The segmentation branch then uses $\theta$ to predict a mask $\hat{M}_q$ over the query image $I_q$.
  • Figure 4: Prototypical networks adapted from snell2017prototypical. The prototypes are computed as the mean $c_k$ of the embeddings from the same class in the support set identified in the picture with distinct colors. The label for the query embedding $X$ is then assigned based on which prototype it is closer.
  • Figure 5: Prototypical networks: a shared feature extractor gets a feature volume from both the support set and query images. The map module takes the feature volume from the support set and masks its ground truth with the Hadamard product $\odot$ to compute the class prototype. The prediction mask $\hat{M}_q$ is calculated as a metric between the vector at each spatial location in the query feature volume with the class prototype.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1: Machine Learning mitchell1997machinemohri2018foundations
  • Definition 2: fsl wang2020generalizing
  • Definition 3: fss BMVC2017_167
  • Definition 4: N-Way K-Shot Semantic Segmentation