Boosting Box-supervised Instance Segmentation with Pseudo Depth
Xinyi Yu, Ling Yan, Pengtao Jiang, Hao Chen, Bo Li, Lin Yuanbo Wu, Linlin Ou
TL;DR
The paper tackles weakly supervised instance segmentation under box annotations by introducing pseudo-depth maps as depth cues to distinguish foreground from background within boxes. It builds a depth-guided mask head (DG-MaskHead) that jointly predicts depth and masks, guided by a depth consistency loss, and uses a depth-aware matching scheme within a Hungarian assignment to select reliable pseudo masks during self-distillation. A teacher-student EMA framework refines masks by leveraging depth-informed pseudo labels, leading to improved mask AP on COCO and Cityscapes, and achieving results close to fully supervised baselines on strong backbones like Swin-Base. This approach demonstrates that coarse depth information, when properly integrated into training, can significantly enhance weakly supervised instance segmentation with practical cross-domain transfer effects.
Abstract
The realm of Weakly Supervised Instance Segmentation (WSIS) under box supervision has garnered substantial attention, showcasing remarkable advancements in recent years. However, the limitations of box supervision become apparent in its inability to furnish effective information for distinguishing foreground from background within the specified target box. This research addresses this challenge by introducing pseudo-depth maps into the training process of the instance segmentation network, thereby boosting its performance by capturing depth differences between instances. These pseudo-depth maps are generated using a readily available depth predictor and are not necessary during the inference stage. To enable the network to discern depth features when predicting masks, we integrate a depth prediction layer into the mask prediction head. This innovative approach empowers the network to simultaneously predict masks and depth, enhancing its ability to capture nuanced depth-related information during the instance segmentation process. We further utilize the mask generated in the training process as supervision to distinguish the foreground from the background. When selecting the best mask for each box through the Hungarian algorithm, we use depth consistency as one calculation cost item. The proposed method achieves significant improvements on Cityscapes and COCO dataset.
