Table of Contents
Fetching ...

Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation

Yue Han, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Yong Liu, Lu Qi, Xiangtai Li, Ming-Hsuan Yang

TL;DR

This work tackles few-shot instance segmentation by addressing overfitting and inadequate exploitation of support cues in RPN-based methods. It introduces Reference Twice (RefT), a transformer-based baseline that uses two cross-attention–driven references to fuse support information at both feature and query levels, coupled with a class-enhanced base knowledge distillation loss for incremental settings. RefT demonstrates strong performance across FSIS, generalized FSIS, and incremental FSIS on COCO and LVIS, outperforming prior methods and maintaining base-class knowledge while learning novel classes. The approach is efficient with a single forward pass over all classes and scales with larger backbones, offering a practical, unified framework for multiple few-shot segmentation settings. These results establish RefT as a solid baseline for future FSIS research and deployment."

Abstract

Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, \eg, $+8.2/+9.4$ performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at https://github.com/hanyue1648/RefT.

Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation

TL;DR

This work tackles few-shot instance segmentation by addressing overfitting and inadequate exploitation of support cues in RPN-based methods. It introduces Reference Twice (RefT), a transformer-based baseline that uses two cross-attention–driven references to fuse support information at both feature and query levels, coupled with a class-enhanced base knowledge distillation loss for incremental settings. RefT demonstrates strong performance across FSIS, generalized FSIS, and incremental FSIS on COCO and LVIS, outperforming prior methods and maintaining base-class knowledge while learning novel classes. The approach is efficient with a single forward pass over all classes and scales with larger backbones, offering a practical, unified framework for multiple few-shot segmentation settings. These results establish RefT as a solid baseline for future FSIS research and deployment."

Abstract

Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, \eg, performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at https://github.com/hanyue1648/RefT.
Paper Structure (21 sections, 11 equations, 12 figures, 23 tables)

This paper contains 21 sections, 11 equations, 12 figures, 23 tables.

Figures (12)

  • Figure 1: (a) Existing RPN-based dual-branch framework and the proposed mask-transformer-based framework. Our method better utilizes the support set on feature and query levels, with only one forward pass handling all classes. (b) Performance on COCO 10-shot. Our unified baseline performs favorably in all settings. Here, P in the circle denotes RoI align or mask pooling, and A in the circle denotes aggregation operation.
  • Figure 2: Baseline results.(a) Confusion matrix and (b) visualization results of several semantically similar classes on COCO minival set (K=10) from the fine-tuned Mask2Former.
  • Figure 3: Support Query Categorization. We visualize the cosine similarity of object queries of the support branch belonging to COCO 20 novel classes. Most object queries are roughly distinguishable, even without fine-tuning. We zoom in on areas that contain highly correlated and easily misclassified classes.
  • Figure 4: Architecture of the proposed Reference Twice (RefT) for FSIS. The query branch refers to the support branch twice on the feature and query level. The first reference for feature-level enhancement performs simultaneous aggregation between the query features and all adaptive class prototypes obtained through mask pooling. The seoncd reference module for query-level feature aggregation links object queries from the query and support branch.
  • Figure 5: First Reference Module for feature-level enhancement performs simultaneous aggregation between the query features and all adaptive class prototypes obtained through mask pooling.
  • ...and 7 more figures