Table of Contents
Fetching ...

UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Chengyuan Zhang, Yilin Zhang, Lei Zhu, Deyin Liu, Lin Wu, Bo Li, Shichao Zhang, Mohammed Bennamoun, Farid Boussaid

TL;DR

This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture, and extends Mask-DINO into a two-stage incremental learning framework that effectively mitigates the over-fitting on novel classes learning.

Abstract

This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches.

UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

TL;DR

This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture, and extends Mask-DINO into a two-stage incremental learning framework that effectively mitigates the over-fitting on novel classes learning.

Abstract

This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches.

Paper Structure

This paper contains 33 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The backbone of UIFormer. It is a unified transformer based model to effectively align region-level task and pixel-level task. Three different prediction heads, i.e., a class-agnostic foreground predictor, a mask predictor and a box predictor are plugged on the top of the encoder to guide the model learning under foreground prediction, mask prediction, bounding box prediction.
  • Figure 2: A two-stage training strategy. The stage 1 is base model training, which includes two step: Step (1) is to training the full model on both object detection and instance segmentation tasks; Step (2) is a fine-tuning on the base by an attention-driven pseudo-labels based self-supervised task. The stage 2 is novel fine-tuning, which further optimize the projection layer and classifier to acquire novel semantic knowledge. Knowledge distillation method is utilized to reduce the discrepancy between the outputs of the projection layer in the base model and those in the novel mode.
  • Figure 3: The visualization of representative results of our method on COCO 2014 with $k$=10. The top rows shows the success cases while the bottom row shows the failure cases.