Table of Contents
Fetching ...

Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning

Hoàng-Ân Lê, Paul Berg, Minh-Tan Pham

TL;DR

This paper proposes Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other for multi-task partially supervised learning.

Abstract

Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different-task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at https://github.com/lhoangan/multas.

Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning

TL;DR

This paper proposes Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other for multi-task partially supervised learning.

Abstract

Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different-task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at https://github.com/lhoangan/multas.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Multi-task partially supervised learning with two tasks, object detection (blue) and semantic segmentation (green). (a) Each image is labeled for a single task, indicated by background colors, thus, can only train the respective head. (b) The proposed Mask-for-Box and Box-for-Mask modules allow training one task head from the other's ground truths.
  • Figure 2: The Mask-for-Box module uses predicted boxes to refine the circumscribed rectangles of the masks' connected components, by separating Xygp multi-instance masks, merging Xygp sub-instance masks, or using as ground truths Xygp . The good Xygp predicted boxes provide the instance cue while the wrong Xygp are to be removed.
  • Figure 3: The Box-for-Mask module generates pseudo-masks by filling the ground truth boxes with the same category and an unsupervised-learning method, like GrabCut GrabCut. The box-shaped pseudo-masks are used to train the attention map $\alpha$ while the other are the predicted masks. The triplet loss constrains the embeddings to follow those with annotations.
  • Figure 4: Qualitative results of refined boxes with magenta indicate the adding, yellow merging boxes from ground truth masks, and blue for boxes from prediction.
  • Figure 5: The network architecture being used in the paper, redrawn from Le2023BMVC with an additional attention module in the segmentation head. The figure is illustrated with a detection-annotated input, thus cannot train the semantic segmentation head.
  • ...and 3 more figures