Table of Contents
Fetching ...

Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design

Pasquale De Marinis, Uzay Kaymak, Rogier Brussee, Gennaro Vessio, Giovanna Castellano

TL;DR

This work tackles interpretability in few-shot semantic segmentation (FSS) by introducing Affinity Explainer (AffEx), which exploits the intrinsic pixel-level matching structure of matching-based FSS models to produce attribution maps over support images. AffEx offers three variants—Unmasked, Masked, and Signed—that derive per-layer contributions from the matching scores across multiple feature levels, aggregated via layer-wise ablation weights and softmax normalization. The authors extend evaluation with causal metrics IAUC/DAUC and create mIoU-based variants to quantify explanation usefulness, demonstrating that AffEx outperforms standard attribution methods on COCO $20^{i}$ and Pascal $5^{i}$ across two representative models (DCAMA and DMTNet) in both quantitative and qualitative analyses, while maintaining reasonable computational efficiency. The paper also provides a comprehensive ablation study, computational-cost analysis, and supplementary LIME adaptations, establishing a foundation for interpretable FSS and outlining directions to broaden applicability to more FSS architectures and hybrid explainability strategies.

Abstract

Few-Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision-making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data-scarce scenarios. This paper introduces the first dedicated method for interpreting matching-based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few-shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few-shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.

Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design

TL;DR

This work tackles interpretability in few-shot semantic segmentation (FSS) by introducing Affinity Explainer (AffEx), which exploits the intrinsic pixel-level matching structure of matching-based FSS models to produce attribution maps over support images. AffEx offers three variants—Unmasked, Masked, and Signed—that derive per-layer contributions from the matching scores across multiple feature levels, aggregated via layer-wise ablation weights and softmax normalization. The authors extend evaluation with causal metrics IAUC/DAUC and create mIoU-based variants to quantify explanation usefulness, demonstrating that AffEx outperforms standard attribution methods on COCO and Pascal across two representative models (DCAMA and DMTNet) in both quantitative and qualitative analyses, while maintaining reasonable computational efficiency. The paper also provides a comprehensive ablation study, computational-cost analysis, and supplementary LIME adaptations, establishing a foundation for interpretable FSS and outlining directions to broaden applicability to more FSS architectures and hybrid explainability strategies.

Abstract

Few-Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision-making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data-scarce scenarios. This paper introduces the first dedicated method for interpreting matching-based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few-shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few-shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.

Paper Structure

This paper contains 19 sections, 14 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the proposed framework with two examples and one class. The Affinity Explainer method can be applied to any matching-based FSS model adhering to the illustrated structure. It computes attribution maps using the matching scores between query and support images. Attribution is restricted to a region $\mathcal{R}$—typically defined by the ground truth mask, segmentation output, or user-specified input. These maps are then softmax-normalized, weighted via feature ablation, and aggregated to yield the final attribution map. The resulting maps highlight the support image regions most influential in segmenting the query, thereby enabling model interpretability. Evaluation is performed using the proposed metrics.
  • Figure 2: A 1-way 1-shot example of Saliency, Blur IG, XRAI, Unmasked Affex, Masked Affex, and Signed Affex for the two tested models DMTNet (first two rows) and DCAMA (last two rows) over four different episodes. The prediction is highlighted in red over the query image, and the ground truth is shown in blue only when it differs significantly from the prediction.
  • Figure S.1: Ablation study on the impact of similarity map resolution on interpretability quality (mIoULoss@0.01) and computational cost (inference time in ms) for AffEx applied to DMTNet chenCrossDomainFewShotSemantic2024a and DCAMA shiDenseCrossQueryandSupportAttention2022. The results highlight the trade-off between attribution mIoUL@0.01 and resource consumption across different resolutions.
  • Figure S.2: Qualitative results for Signed AffEx with DCAMA across multiple episodes from COCO-$20^{i}$ (first two rows) and Pascal-$5^{i}$ (last two rows). For each episode, the left column shows the query image with the predicted segmentation in red; the ground truth is overlaid in blue when it significantly differs from the prediction. The right column presents the corresponding support set alongside the attribution maps produced by the proposed method.
  • Figure S.3: Qualitative results for Signed AffEx with DCAMA across multiple episodes from COCO-$20^{i}$ (first two rows) and Pascal-$5^{i}$ (last two rows). For each episode, the left column shows the query image with the predicted segmentation in red; the ground truth is overlaid in blue when it significantly differs from the prediction. The right column presents the corresponding support set alongside the attribution maps produced by the proposed method.
  • ...and 2 more figures