Table of Contents
Fetching ...

More than the Sum of Its Parts: Ensembling Backbone Networks for Few-Shot Segmentation

Nico Catalano, Alessandro Maranelli, Agnese Chiatti, Matteo Matteucci

TL;DR

The paper addresses the challenge of data-scarce few-shot segmentation by examining whether ensembling features from multiple backbones can outperform single-backbone approaches. It introduces two ensembling strategies—Independent Voting and Feature Volume Fusion—within PANet’s non-parametric mask prediction, enabling a controlled study of backbone combinations (VGG16, ResNet50, MobileNet-V3-Large) in a one-shot setting. Across PASCAL-5i and COCO-20i benchmarks, three-backbone ensembles yield significant gains in mean IoU (+7.37% on PASCAL-5i and +10.68% on COCO-20i), demonstrating that diverse embeddings provide complementary information for accurate segmentation. The findings suggest that multi-backbone representations can accelerate effective fss in data-scarce environments and point to future work exploring transformers and efficient, attention-free alternatives to further enhance robustness and efficiency.

Abstract

Semantic segmentation is a key prerequisite to robust image understanding for applications in \acrlong{ai} and Robotics. \acrlong{fss}, in particular, concerns the extension and optimization of traditional segmentation methods in challenging conditions where limited training examples are available. A predominant approach in \acrlong{fss} is to rely on a single backbone for visual feature extraction. Choosing which backbone to leverage is a deciding factor contributing to the overall performance. In this work, we interrogate on whether fusing features from different backbones can improve the ability of \acrlong{fss} models to capture richer visual features. To tackle this question, we propose and compare two ensembling techniques-Independent Voting and Feature Fusion. Among the available \acrlong{fss} methods, we implement the proposed ensembling techniques on PANet. The module dedicated to predicting segmentation masks from the backbone embeddings in PANet avoids trainable parameters, creating a controlled `in vitro' setting for isolating the impact of different ensembling strategies. Leveraging the complementary strengths of different backbones, our approach outperforms the original single-backbone PANet across standard benchmarks even in challenging one-shot learning scenarios. Specifically, it achieved a performance improvement of +7.37\% on PASCAL-5\textsuperscript{i} and of +10.68\% on COCO-20\textsuperscript{i} in the top-performing scenario where three backbones are combined. These results, together with the qualitative inspection of the predicted subject masks, suggest that relying on multiple backbones in PANet leads to a more comprehensive feature representation, thus expediting the successful application of \acrlong{fss} methods in challenging, data-scarce environments.

More than the Sum of Its Parts: Ensembling Backbone Networks for Few-Shot Segmentation

TL;DR

The paper addresses the challenge of data-scarce few-shot segmentation by examining whether ensembling features from multiple backbones can outperform single-backbone approaches. It introduces two ensembling strategies—Independent Voting and Feature Volume Fusion—within PANet’s non-parametric mask prediction, enabling a controlled study of backbone combinations (VGG16, ResNet50, MobileNet-V3-Large) in a one-shot setting. Across PASCAL-5i and COCO-20i benchmarks, three-backbone ensembles yield significant gains in mean IoU (+7.37% on PASCAL-5i and +10.68% on COCO-20i), demonstrating that diverse embeddings provide complementary information for accurate segmentation. The findings suggest that multi-backbone representations can accelerate effective fss in data-scarce environments and point to future work exploring transformers and efficient, attention-free alternatives to further enhance robustness and efficiency.

Abstract

Semantic segmentation is a key prerequisite to robust image understanding for applications in \acrlong{ai} and Robotics. \acrlong{fss}, in particular, concerns the extension and optimization of traditional segmentation methods in challenging conditions where limited training examples are available. A predominant approach in \acrlong{fss} is to rely on a single backbone for visual feature extraction. Choosing which backbone to leverage is a deciding factor contributing to the overall performance. In this work, we interrogate on whether fusing features from different backbones can improve the ability of \acrlong{fss} models to capture richer visual features. To tackle this question, we propose and compare two ensembling techniques-Independent Voting and Feature Fusion. Among the available \acrlong{fss} methods, we implement the proposed ensembling techniques on PANet. The module dedicated to predicting segmentation masks from the backbone embeddings in PANet avoids trainable parameters, creating a controlled `in vitro' setting for isolating the impact of different ensembling strategies. Leveraging the complementary strengths of different backbones, our approach outperforms the original single-backbone PANet across standard benchmarks even in challenging one-shot learning scenarios. Specifically, it achieved a performance improvement of +7.37\% on PASCAL-5\textsuperscript{i} and of +10.68\% on COCO-20\textsuperscript{i} in the top-performing scenario where three backbones are combined. These results, together with the qualitative inspection of the predicted subject masks, suggest that relying on multiple backbones in PANet leads to a more comprehensive feature representation, thus expediting the successful application of \acrlong{fss} methods in challenging, data-scarce environments.
Paper Structure (13 sections, 11 equations, 4 figures, 2 tables)

This paper contains 13 sections, 11 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Adapted from panet, this diagram illustrates the inference process of PANet. First, features are extracted from both the Query Image and the examples in the Support Set through a shared backbone. Then, Masked Average Pooling is applied to features extracted from the support images, generating prototypes for each labeled subject class. Ultimately, the cosine distance is computed between the embeddings at each spatial location within the query feature volume and each prototype, yielding the predicted mask $\hat{M}_q$.
  • Figure 2: Independent Voting: the Query Image and Support Set examples are passed in parallel through multiple PANet branches, each employing a distinct backbone. The individual probability maps generated by each branch are then combined using Bayesian voting to produce the final prediction, $\hat{M}_q$.
  • Figure 3: Feature Volume Fusion: This diagram illustrates the Feature Volume Fusion process, where two or more backbones are applied for extracting features from the Query Image and examples in the Support Set. These features are then concatenated along the channel axis, forming a consolidated ensembled feature map. The ensembled feature map is subsequently given as an input to the non-parametric Metric Learning stage of PANet.
  • Figure 4: Qualitative Results: column (a) shows Query images with ground truth labels. The predictions of the baseline models are displayed in column (b) for MobileNet-V3-Large, column (c) VGG16, and column (d) for ResNet50. These include notable false-positive and false-negative predictions, revealing the challenges in accurately capturing certain object parts. In contrast, predictions from ensemble techniques configured with all three backbones demonstrate significant improvements. As shown in columns (e) Independent Voting and (f) Feature Volume Fusion, subject coverage is enhanced, compensating for the limitations observed for the individual baselines. Under each prediction we also report the iou score achieved.