Table of Contents
Fetching ...

Boosting Box-supervised Instance Segmentation with Pseudo Depth

Xinyi Yu, Ling Yan, Pengtao Jiang, Hao Chen, Bo Li, Lin Yuanbo Wu, Linlin Ou

TL;DR

The paper tackles weakly supervised instance segmentation under box annotations by introducing pseudo-depth maps as depth cues to distinguish foreground from background within boxes. It builds a depth-guided mask head (DG-MaskHead) that jointly predicts depth and masks, guided by a depth consistency loss, and uses a depth-aware matching scheme within a Hungarian assignment to select reliable pseudo masks during self-distillation. A teacher-student EMA framework refines masks by leveraging depth-informed pseudo labels, leading to improved mask AP on COCO and Cityscapes, and achieving results close to fully supervised baselines on strong backbones like Swin-Base. This approach demonstrates that coarse depth information, when properly integrated into training, can significantly enhance weakly supervised instance segmentation with practical cross-domain transfer effects.

Abstract

The realm of Weakly Supervised Instance Segmentation (WSIS) under box supervision has garnered substantial attention, showcasing remarkable advancements in recent years. However, the limitations of box supervision become apparent in its inability to furnish effective information for distinguishing foreground from background within the specified target box. This research addresses this challenge by introducing pseudo-depth maps into the training process of the instance segmentation network, thereby boosting its performance by capturing depth differences between instances. These pseudo-depth maps are generated using a readily available depth predictor and are not necessary during the inference stage. To enable the network to discern depth features when predicting masks, we integrate a depth prediction layer into the mask prediction head. This innovative approach empowers the network to simultaneously predict masks and depth, enhancing its ability to capture nuanced depth-related information during the instance segmentation process. We further utilize the mask generated in the training process as supervision to distinguish the foreground from the background. When selecting the best mask for each box through the Hungarian algorithm, we use depth consistency as one calculation cost item. The proposed method achieves significant improvements on Cityscapes and COCO dataset.

Boosting Box-supervised Instance Segmentation with Pseudo Depth

TL;DR

The paper tackles weakly supervised instance segmentation under box annotations by introducing pseudo-depth maps as depth cues to distinguish foreground from background within boxes. It builds a depth-guided mask head (DG-MaskHead) that jointly predicts depth and masks, guided by a depth consistency loss, and uses a depth-aware matching scheme within a Hungarian assignment to select reliable pseudo masks during self-distillation. A teacher-student EMA framework refines masks by leveraging depth-informed pseudo labels, leading to improved mask AP on COCO and Cityscapes, and achieving results close to fully supervised baselines on strong backbones like Swin-Base. This approach demonstrates that coarse depth information, when properly integrated into training, can significantly enhance weakly supervised instance segmentation with practical cross-domain transfer effects.

Abstract

The realm of Weakly Supervised Instance Segmentation (WSIS) under box supervision has garnered substantial attention, showcasing remarkable advancements in recent years. However, the limitations of box supervision become apparent in its inability to furnish effective information for distinguishing foreground from background within the specified target box. This research addresses this challenge by introducing pseudo-depth maps into the training process of the instance segmentation network, thereby boosting its performance by capturing depth differences between instances. These pseudo-depth maps are generated using a readily available depth predictor and are not necessary during the inference stage. To enable the network to discern depth features when predicting masks, we integrate a depth prediction layer into the mask prediction head. This innovative approach empowers the network to simultaneously predict masks and depth, enhancing its ability to capture nuanced depth-related information during the instance segmentation process. We further utilize the mask generated in the training process as supervision to distinguish the foreground from the background. When selecting the best mask for each box through the Hungarian algorithm, we use depth consistency as one calculation cost item. The proposed method achieves significant improvements on Cityscapes and COCO dataset.
Paper Structure (15 sections, 11 equations, 6 figures, 3 tables)

This paper contains 15 sections, 11 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Box-supervised instance segmentation. (a) Input image. (b) Box annotation. (c) Pseudo depth map generated with an off-the-shelf depth predictor ranftl2021vision. (d) Instance segmentation result of the proposed method.
  • Figure 2: The best matching masks. With the depth matching score, we can select more fitting masks (last column) than only using the IoU score (second column).
  • Figure 3: Depth-guided box-supervised instance segmentation. First, the network is trained with box annotations and pseudo-depth maps. During this process, a depth consistency loss is utilized to facilitate the network producing consistent predictions for depth-coherent regions. In the last several training steps, we employ a self-distillation process, following cheng2022boxteacherliu2021unbiased. We define a depth matching score in depth-aware Hungarian algorithm to assign reliable masks for continued network training. In this framework, the teacher network is updated with an exponential moving average (EMA tarvainen2017mean) and generates pseudo mask to the realize self-distillation process. DG-MaskHead refers to our depth-guided mask head module.
  • Figure 4: Depth-guided mask prediction head. This head contains a mask prediction head (MaskHead) and a depth estimation layer to predict mask and depth simultaneously, where depth features help the mask prediction head generate the same prediction for depth consistent area.
  • Figure 5: Visualization results on COCO-val lin2014microsoft. The top row is outputs from our method, while the bottom row is BoxInst tian2021boxinst. Our method improves performance in complex scenarios, such as occlusion, while effectively suppressing background noise similar to the foreground.
  • ...and 1 more figures