Table of Contents
Fetching ...

Panoptic-Depth Forecasting

Juana Valeria Hurtado, Riya Mohan, Abhinav Valada

TL;DR

This work proposes the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images, and proposes the novel PDcast architecture that learns rich spatio-temporal representations.

Abstract

Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de.

Panoptic-Depth Forecasting

TL;DR

This work proposes the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images, and proposes the novel PDcast architecture that learns rich spatio-temporal representations.

Abstract

Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de.
Paper Structure (18 sections, 3 equations, 4 figures, 4 tables)

This paper contains 18 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Panoptic-depth forecasting learns rich spatio-temporal representations to jointly predict the pixel-level semantic category, instance ID, and depth value of unobserved future frames.
  • Figure 2: Overview of our proposed PDcast architecture for panoptic-depth forecasting. A single transformed-based encoder extracts rich spatio-temporal features from past monocular camera images. The forecasting module then learns to forecast features into the future, which serve as the input to the two decoders for panoptic segmentation and depth estimation.
  • Figure 3: Architecture of the (a) spatial module, (b) temporal module, and (c) forecasting module. The spatial module processes each time frame separately. The spatial and forecasting modules show the process for each scale $s = [2,4,8,16]$.
  • Figure 4: Qualitative comparison of predictions from our proposed PDcast model with the second-best baseline CoDEPS(+) on the KITTI-360 dataset. We show the camera image corresponding to the future frame at $t + \Delta$ and the panoptic-depth ground truth (GT). We observe that our model accurately forecasts panoptic-depth predictions, capturing scene details such as poles, even when a car is exiting the frame.