Table of Contents
Fetching ...

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

Xiangtai Li, Shilin Xu, Yibo Yang, Haobo Yuan, Guangliang Cheng, Yunhai Tong, Zhouchen Lin, Ming-Hsuan Yang, Dacheng Tao

TL;DR

The first end-to-end unified framework, Panoptic-PartFormer is designed, designing a meta-architecture that decouples part features and things/stuff features, respectively and proposes a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives.

Abstract

Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS. The source code and trained models will be available at~\url{https://github.com/lxtGH/Panoptic-PartFormer}.

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

TL;DR

The first end-to-end unified framework, Panoptic-PartFormer is designed, designing a meta-architecture that decouples part features and things/stuff features, respectively and proposes a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives.

Abstract

Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS. The source code and trained models will be available at~\url{https://github.com/lxtGH/Panoptic-PartFormer}.
Paper Structure (21 sections, 9 equations, 17 figures, 16 tables)

This paper contains 21 sections, 9 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Illustration of the Panoptic Part Segmentation task. It combines Panoptic Segmentation and Part Segmentation in a unified manner that provides a multi-level concept understanding of the image. Best viewed in color.
  • Figure 2: (a) The baseline method proposed in degeus2021panopticparts combines results of panoptic segmentation and part segmentation. (b) Panoptic-FPN-like baseline kirillov2019panopticfpnli2019attentionxiong2019upsnet adds part segmentation into the current panoptic segmentation frameworks. (c) Our proposed approach represents things, stuff, and parts via object queries and performs joint learning in a unified manner.
  • Figure 3: Our proposed Panoptic-PartFormer. It contains three parts: (a) a backbone to extract features (Red area), (b) a decoupled decoder to generate scene features and part features along with the initial prediction heads to generate initial mask predictions. (Yellow area), (c) A cascaded transformer decoder will conduct the reasoning between the query and query features (Green area). Green arrows mean input (come from the previous stage) while Red arrows represent current stage output (used for the next stage). The outputs in the Red arrows are the inputs in the Green arrows. We take the last stage outputs as final output detrcheng2021mask2former.
  • Figure 4: Meta architecture of Panoptic-PartFormer. (a) is the default general Mask Transformer design for PPS. (b) is our Panoptic-PartFormer that decouples the Scene and Part heads in cases for both learning part features and cross-reasoning stages.
  • Figure 5: Visual examples for PartPQ and PWQ. Left: Merged baseline. Right: Our PanopticPartFormer++. Both methods use Swin-base as the backbone. Although both methods achieve similar PartPQ results. The part segmentation quality is significantly different. Our proposed PWQ indicates the better difference between both methods for part details.
  • ...and 12 more figures