Table of Contents
Fetching ...

Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

Hongli Liu, Yu Wang, Shengjie Zhao

TL;DR

VINE is introduced, a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes to alleviate foreground ambiguity and validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures.

Abstract

Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.

Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

TL;DR

VINE is introduced, a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes to alleviate foreground ambiguity and validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures.

Abstract

Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.
Paper Structure (28 sections, 17 equations, 5 figures, 4 tables)

This paper contains 28 sections, 17 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between conventional prototype modeling and VINE. (a) View-induced challenges: large intra-class variation and inter-class similarity lead to prototype confusion. (b) Top: conventional mask-guided enhancement fails under viewpoint shifts, generating unstable prototypes. Bottom: VINE unifies spatial–view graph alignment and discriminative modulation to produce view-consistent, structurally reliable prototypes.
  • Figure 2: An overview of our proposed framework. We incorporate Spatial-View Graph Alignment (SVGA) to capture structural consistency across views, and Discriminative Foreground Modulation (DFM) to highlight foreground-relevant features, collaboratively improving the quality of support-query prototypes and guiding accurate mask prediction.
  • Figure 3: Parameter efficiency Comparison.
  • Figure 4: Pseudo-mask quality under different methods.
  • Figure 5: Qualitative and feature-level comparison of segmentation results.