Table of Contents
Fetching ...

Self-Balanced R-CNN for Instance Segmentation

Leonardo Rossi, Akbar Karimi, Andrea Prati

TL;DR

Self-Balanced R-CNN (SBR-CNN) tackles IoU Distribution Imbalance (IDI) and Feature Level Imbalance (FLI) in two-stage instance segmentation by introducing an $R^3$-CNN loop for IoU-balanced RoI refinement, Fully Connected Channels (L2C) to create a lighter, fully convolutional head, and Generic RoI Extraction Layer (GRoIE) to fuse multi-scale FPN features. The architecture is complemented by a redesigned Mask IoU branch and deep exploration of GRoIE’s pre- and post-processing options, showing consistent improvements across state-of-the-art models and backbones. Extensive ablations demonstrate the benefits of multi-loop training, convolutional head design, and robust multi-scale RoI feature aggregation, with SBR-CNN achieving 45.3% AP for object detection and 41.5% AP for instance segmentation on COCO minival 2017 using only a ResNet-50 backbone. The work provides practical, plug-in components for improving two-stage detectors, offering strong performance gains with controlled parameter growth and clear guidance on when to deploy non-local attention in classification versus regression tasks.

Abstract

Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at https://github.com/IMPLabUniPr/mmdetection/tree/sbr_cnn

Self-Balanced R-CNN for Instance Segmentation

TL;DR

Self-Balanced R-CNN (SBR-CNN) tackles IoU Distribution Imbalance (IDI) and Feature Level Imbalance (FLI) in two-stage instance segmentation by introducing an -CNN loop for IoU-balanced RoI refinement, Fully Connected Channels (L2C) to create a lighter, fully convolutional head, and Generic RoI Extraction Layer (GRoIE) to fuse multi-scale FPN features. The architecture is complemented by a redesigned Mask IoU branch and deep exploration of GRoIE’s pre- and post-processing options, showing consistent improvements across state-of-the-art models and backbones. Extensive ablations demonstrate the benefits of multi-loop training, convolutional head design, and robust multi-scale RoI feature aggregation, with SBR-CNN achieving 45.3% AP for object detection and 41.5% AP for instance segmentation on COCO minival 2017 using only a ResNet-50 backbone. The work provides practical, plug-in components for improving two-stage detectors, offering strong performance gains with controlled parameter growth and clear guidance on when to deploy non-local attention in classification versus regression tasks.

Abstract

Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at https://github.com/IMPLabUniPr/mmdetection/tree/sbr_cnn
Paper Structure (24 sections, 6 equations, 7 figures, 11 tables)

This paper contains 24 sections, 6 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Percentage of times in which, during the RPN training, there does not exist an anchor with a certain value of IoU w.r.t. the ground-truth bounding boxes.
  • Figure 2: Network design. (a) HTC: a multi-stage network which trains each head in a cascade fashion. (b) $R^3$-CNN: our architecture which introduces two loop mechanisms to self-train the heads.
  • Figure 3: The IoU distribution of training samples for Mask R-CNN with a 3x schedule (36 epochs) (a), and $R^3$-CNN where at each loop it uses a different IoU threshold [0.5, 0.6, 0.7] (b). Better seen in color.
  • Figure 4: (a) Original HTC detector head. (b) Our lighter detector using convolutions with $7\times7$ kernels. (c) Evolution of (b) with rectangular convolutions. (d) Evolution of (b) with non-local pre-processing block. (e) Evolution of (c) with non-local pre-processing block.
  • Figure 5: GRoIE framework. (1) RoI Pooler. (2) Pre-processing phase. (3) Aggregation function. (4) Post-processing phase.
  • ...and 2 more figures