Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation
Yue Han, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Yong Liu, Lu Qi, Xiangtai Li, Ming-Hsuan Yang
TL;DR
This work tackles few-shot instance segmentation by addressing overfitting and inadequate exploitation of support cues in RPN-based methods. It introduces Reference Twice (RefT), a transformer-based baseline that uses two cross-attention–driven references to fuse support information at both feature and query levels, coupled with a class-enhanced base knowledge distillation loss for incremental settings. RefT demonstrates strong performance across FSIS, generalized FSIS, and incremental FSIS on COCO and LVIS, outperforming prior methods and maintaining base-class knowledge while learning novel classes. The approach is efficient with a single forward pass over all classes and scales with larger backbones, offering a practical, unified framework for multiple few-shot segmentation settings. These results establish RefT as a solid baseline for future FSIS research and deployment."
Abstract
Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, \eg, $+8.2/+9.4$ performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at https://github.com/hanyue1648/RefT.
