Self-Balanced R-CNN for Instance Segmentation
Leonardo Rossi, Akbar Karimi, Andrea Prati
TL;DR
Self-Balanced R-CNN (SBR-CNN) tackles IoU Distribution Imbalance (IDI) and Feature Level Imbalance (FLI) in two-stage instance segmentation by introducing an $R^3$-CNN loop for IoU-balanced RoI refinement, Fully Connected Channels (L2C) to create a lighter, fully convolutional head, and Generic RoI Extraction Layer (GRoIE) to fuse multi-scale FPN features. The architecture is complemented by a redesigned Mask IoU branch and deep exploration of GRoIE’s pre- and post-processing options, showing consistent improvements across state-of-the-art models and backbones. Extensive ablations demonstrate the benefits of multi-loop training, convolutional head design, and robust multi-scale RoI feature aggregation, with SBR-CNN achieving 45.3% AP for object detection and 41.5% AP for instance segmentation on COCO minival 2017 using only a ResNet-50 backbone. The work provides practical, plug-in components for improving two-stage detectors, offering strong performance gains with controlled parameter growth and clear guidance on when to deploy non-local attention in classification versus regression tasks.
Abstract
Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at https://github.com/IMPLabUniPr/mmdetection/tree/sbr_cnn
