Table of Contents
Fetching ...

Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations

Daan de Geus, Gijs Dubbelman

TL;DR

Task-Aligned Part-aware Panoptic Segmentation (TAPPS) uses a set of shared queries to jointly predict object-level segments, and the part-level segments within those same objects, and learns to predict part-level segments that are linked to individual parent objects, aligning the learning objective with the task objective, and allowing TAPPS to leverage joint object-part representations.

Abstract

Part-aware panoptic segmentation (PPS) requires (a) that each foreground object and background region in an image is segmented and classified, and (b) that all parts within foreground objects are segmented, classified and linked to their parent object. Existing methods approach PPS by separately conducting object-level and part-level segmentation. However, their part-level predictions are not linked to individual parent objects. Therefore, their learning objective is not aligned with the PPS task objective, which harms the PPS performance. To solve this, and make more accurate PPS predictions, we propose Task-Aligned Part-aware Panoptic Segmentation (TAPPS). This method uses a set of shared queries to jointly predict (a) object-level segments, and (b) the part-level segments within those same objects. As a result, TAPPS learns to predict part-level segments that are linked to individual parent objects, aligning the learning objective with the task objective, and allowing TAPPS to leverage joint object-part representations. With experiments, we show that TAPPS considerably outperforms methods that predict objects and parts separately, and achieves new state-of-the-art PPS results.

Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations

TL;DR

Task-Aligned Part-aware Panoptic Segmentation (TAPPS) uses a set of shared queries to jointly predict object-level segments, and the part-level segments within those same objects, and learns to predict part-level segments that are linked to individual parent objects, aligning the learning objective with the task objective, and allowing TAPPS to leverage joint object-part representations.

Abstract

Part-aware panoptic segmentation (PPS) requires (a) that each foreground object and background region in an image is segmented and classified, and (b) that all parts within foreground objects are segmented, classified and linked to their parent object. Existing methods approach PPS by separately conducting object-level and part-level segmentation. However, their part-level predictions are not linked to individual parent objects. Therefore, their learning objective is not aligned with the PPS task objective, which harms the PPS performance. To solve this, and make more accurate PPS predictions, we propose Task-Aligned Part-aware Panoptic Segmentation (TAPPS). This method uses a set of shared queries to jointly predict (a) object-level segments, and (b) the part-level segments within those same objects. As a result, TAPPS learns to predict part-level segments that are linked to individual parent objects, aligning the learning objective with the task objective, and allowing TAPPS to leverage joint object-part representations. With experiments, we show that TAPPS considerably outperforms methods that predict objects and parts separately, and achieves new state-of-the-art PPS results.
Paper Structure (37 sections, 3 equations, 11 figures, 11 tables)

This paper contains 37 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Task-aligned part-aware panoptic segmentation.(a) Existing works separately predict object-level segments and object-instance-unaware part-level segments. (b) In this work, we predict objects and parts jointly, using a set of shared queries. This allows our method to predict parts within individual object segments, aligning its learning objective with the PPS task objective.
  • Figure 2: Network architecture. Left: The overall TAPPS architecture. A set of learnable queries and features from a backbone are fed into a pixel decoder and transformer decoder to generate high-resolution features and processed queries. Right: These queries and features are fed into the JOPS head, which predicts for each shared query (a) an object-level class, (b) an object-level segmentation mask, and (c) a set of part-level masks for the part-level classes compatible with the object-level class. Operator $\otimes$ denotes a matrix multiplication.
  • Figure 3: Baseline network architecture. Our strong baseline uses two separate sets of queries, one set for object-level segmentation and another set for part-level segmentation. Using these two sets of queries, this baseline network separately predicts object-level segments and object-unaware part-level segments. Operator $\otimes$ denotes a matrix multiplication.
  • Figure 4: Dynamic part segmentation. When conducting dynamic part segmentation, the JOPS head uses $N^{\textrm{dyn}}$ fully-connected (FC) layers to generate $N^{\textrm{dyn}}$ per-object part queries. Each per-object part query dynamically learns to represent at most one part-level segment within an object. For each per-object part query, we predict (a) a part-level class and (b) a part-level mask.
  • Figure 5: Qualitative examples of TAPPS and our strong baseline on Pascal-PP chen2014pascalparteveringham2010pascalmottaghi14pascalcontextdegeus2021pps. Both networks use ResNet-50 he2016resnet with COCO pre-training lin2014coco. White borders separate different object-level instances; color shades indicate different categories. Note that the colors of part-level categories are not identical across instances; there are different shades of the same color. In these examples, we can see how TAPPS improves both the instance separability and part segmentation quality with respect to the strong baseline. The red boxes indicate regions in which these differences are best visible. Best viewed digitally.
  • ...and 6 more figures