Table of Contents
Fetching ...

Depth-aware Panoptic Segmentation

Tuan Nguyen, Max Mehltretter, Franz Rottensteiner

TL;DR

This work addresses panoptic segmentation by incorporating 3D geometry through RGB-depth data in a late-fusion CNN framework. Building on Panoptic FCN, it adds a depth encoder and a depth-aware Dice loss to better separate visually similar object instances, particularly among thing classes. On Cityscapes, the method achieves a +2.2 percentage point improvement in panoptic quality, with larger gains for thing classes and a reduction in merged instances, illustrating the value of explicit depth information. The approach highlights practical benefits and suggests future directions such as incorporating 3D spatial distances and temporal sequences to further enhance segmentation robustness.

Abstract

Panoptic segmentation unifies semantic and instance segmentation and thus delivers a semantic class label and, for so-called thing classes, also an instance label per pixel. The differentiation of distinct objects of the same class with a similar appearance is particularly challenging and frequently causes such objects to be incorrectly assigned to a single instance. In the present work, we demonstrate that information on the 3D geometry of the observed scene can be used to mitigate this issue: We present a novel CNN-based method for panoptic segmentation which processes RGB images and depth maps given as input in separate network branches and fuses the resulting feature maps in a late fusion manner. Moreover, we propose a new depth-aware dice loss term which penalises the assignment of pixels to the same thing instance based on the difference between their associated distances to the camera. Experiments carried out on the Cityscapes dataset show that the proposed method reduces the number of objects that are erroneously merged into one thing instance and outperforms the method used as basis by 2.2% in terms of panoptic quality.

Depth-aware Panoptic Segmentation

TL;DR

This work addresses panoptic segmentation by incorporating 3D geometry through RGB-depth data in a late-fusion CNN framework. Building on Panoptic FCN, it adds a depth encoder and a depth-aware Dice loss to better separate visually similar object instances, particularly among thing classes. On Cityscapes, the method achieves a +2.2 percentage point improvement in panoptic quality, with larger gains for thing classes and a reduction in merged instances, illustrating the value of explicit depth information. The approach highlights practical benefits and suggests future directions such as incorporating 3D spatial distances and temporal sequences to further enhance segmentation robustness.

Abstract

Panoptic segmentation unifies semantic and instance segmentation and thus delivers a semantic class label and, for so-called thing classes, also an instance label per pixel. The differentiation of distinct objects of the same class with a similar appearance is particularly challenging and frequently causes such objects to be incorrectly assigned to a single instance. In the present work, we demonstrate that information on the 3D geometry of the observed scene can be used to mitigate this issue: We present a novel CNN-based method for panoptic segmentation which processes RGB images and depth maps given as input in separate network branches and fuses the resulting feature maps in a late fusion manner. Moreover, we propose a new depth-aware dice loss term which penalises the assignment of pixels to the same thing instance based on the difference between their associated distances to the camera. Experiments carried out on the Cityscapes dataset show that the proposed method reduces the number of objects that are erroneously merged into one thing instance and outperforms the method used as basis by 2.2% in terms of panoptic quality.
Paper Structure (17 sections, 9 equations, 5 figures, 3 tables)

This paper contains 17 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Left: A binary instance mask predicted by li2021fully which erroneously merges two car instances, superimposed to the input. Right: We exploit the depth difference $\bar{d_j}$ between pixels corresponding to different instances (the triangle and the circle) in training to mitigate the problem.
  • Figure 2: Our proposed method. The blocks with a red edging are our proposed modules. The remaining ones are also used in Panoptic FCN, but there the output of the colour encoder is directly processed by the feature encoder and kernel generator blocks li2021fully. Our method additionally uses an encoder for the depth map and a fusion module; the subsequent blocks process the results of colour and depth fusion. $\otimes$ indicates a convolution. In training, we use a new depth-aware Dice loss for the thing instances.
  • Figure 3: Visualisation of the Dice function (left) and the new depth-aware variant (right). In the latter case, the consideration of depth will increase the penalty for FPs with large depth differences compared to the TPs.
  • Figure 4: Qualitative examples of results achieved on the Cityscapes data. Top: results of the Panoptic FCN baseline li2021fully, bottom: results of our method. Different colours are used to identify stuff class or thing instance to which a pixel is assigned, and the resultant label maps are superimposed to the RGB input images. The red boxes highlight examples in which the baseline erroneously merged two thing instances, whereas our method separated them correctly.
  • Figure 5: Failure cases of our method: the red boxes indicate instances erroneously merged by our method. The merged instances occur at a similar depth.