Table of Contents
Fetching ...

PanDepth: Joint Panoptic Segmentation and Depth Completion

Juan Lagos, Esa Rahtu

TL;DR

PanDepth addresses the need for holistic 3D scene understanding in autonomous driving by jointly performing panoptic segmentation and depth completion from RGB images and sparse depth. The authors propose an end-to-end architecture with a two-way FPN backbone (EfficientNet-B5), three task-specific branches (semantic, instance, depth), a joint refinement branch, and a panoptic fusion module, trained with a combined loss. On Virtual KITTI 2, PanDepth achieves dense depth and panoptic segmentation while maintaining a modest parameter count and competitive accuracy across tasks, outperforming some baselines in semantic and depth metrics. The work also provides generated panoptic annotations for Virtual KITTI 2 and demonstrates the practical viability of joint learning for integrated 3D scene understanding in driving scenarios.

Abstract

Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple computer vision tasks are involved. Multi-task models provide different types of outputs for a given scene, yielding a more holistic representation while keeping the computational cost low. We propose a multi-task model for panoptic segmentation and depth completion using RGB images and sparse depth maps. Our model successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and panoptic segmentation for every input frame. Extensive experiments were done on the Virtual KITTI 2 dataset and we demonstrate that our model solves multiple tasks, without a significant increase in computational cost, while keeping high accuracy performance. Code is available at https://github.com/juanb09111/PanDepth.git

PanDepth: Joint Panoptic Segmentation and Depth Completion

TL;DR

PanDepth addresses the need for holistic 3D scene understanding in autonomous driving by jointly performing panoptic segmentation and depth completion from RGB images and sparse depth. The authors propose an end-to-end architecture with a two-way FPN backbone (EfficientNet-B5), three task-specific branches (semantic, instance, depth), a joint refinement branch, and a panoptic fusion module, trained with a combined loss. On Virtual KITTI 2, PanDepth achieves dense depth and panoptic segmentation while maintaining a modest parameter count and competitive accuracy across tasks, outperforming some baselines in semantic and depth metrics. The work also provides generated panoptic annotations for Virtual KITTI 2 and demonstrates the practical viability of joint learning for integrated 3D scene understanding in driving scenarios.

Abstract

Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple computer vision tasks are involved. Multi-task models provide different types of outputs for a given scene, yielding a more holistic representation while keeping the computational cost low. We propose a multi-task model for panoptic segmentation and depth completion using RGB images and sparse depth maps. Our model successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and panoptic segmentation for every input frame. Extensive experiments were done on the Virtual KITTI 2 dataset and we demonstrate that our model solves multiple tasks, without a significant increase in computational cost, while keeping high accuracy performance. Code is available at https://github.com/juanb09111/PanDepth.git
Paper Structure (26 sections, 5 equations, 5 figures, 2 tables)

This paper contains 26 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed model (PanDepth) takes RGB images and sparse depth and returns the corresponding panoptic segmentation and fully dense depth map with which we create a 3D panoptic segmentation representation of the input frame.
  • Figure 2: Overview of the proposed PanDepth architecture. Given an RGB image and sparse depth map as input, our model outputs the corresponding dense depth map and panoptic segmentation.
  • Figure 3: PanDepth architecture. Our model consists of a feature extractor, three task-specific branches (i.e. instance segmentation, semantic segmentation, and depth completion), a joint branch, and a panoptic fusion module. The convolutional layers in this diagram follow the notation Conv($k$,$s$,$c$) $\times n$ representing a stack of $n$ convolutional layers where $k$ refers to a kernel of size $k \times k$, $s$ is the stride, $c$ is the number of output feature channels, and $FC$ represents a fully connected layer.
  • Figure 4: Depth maps visualization at different sparsity levels
  • Figure 5: Panoptic segmentation and depth completion results on Virtual KITTI 2. Rows from top down show: (a) RGB input images, (b) semantic segmentation, (c) instance segmentation, (d) panoptic segmentation, (e) depth completion output, and (f) 3D panoptic segmentation.