Table of Contents
Fetching ...

MGNiceNet: Unified Monocular Geometric Scene Understanding

Markus Schön, Michael Buchholz, Klaus Dietmayer

TL;DR

MGNiceNet addresses the need for real-time monocular geometric scene understanding by unifying panoptic segmentation and self-supervised depth estimation. It extends the real-time RT-K-Net with four linked kernel update heads and a lightweight depth predictor that operates at the panoptic mask level, enabling explicit cross-task coupling. A panoptic-guided motion masking strategy mitigates dynamic-object interference during self-supervised training, improving depth accuracy without requiring video panoptic annotations. Through extensive experiments on Cityscapes and KITTI, MGNiceNet achieves state-of-the-art real-time panoptic performance and competitive depth accuracy, while maintaining fast inference suitable for autonomous driving systems.

Abstract

Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at https://github.com/markusschoen/MGNiceNet.

MGNiceNet: Unified Monocular Geometric Scene Understanding

TL;DR

MGNiceNet addresses the need for real-time monocular geometric scene understanding by unifying panoptic segmentation and self-supervised depth estimation. It extends the real-time RT-K-Net with four linked kernel update heads and a lightweight depth predictor that operates at the panoptic mask level, enabling explicit cross-task coupling. A panoptic-guided motion masking strategy mitigates dynamic-object interference during self-supervised training, improving depth accuracy without requiring video panoptic annotations. Through extensive experiments on Cityscapes and KITTI, MGNiceNet achieves state-of-the-art real-time panoptic performance and competitive depth accuracy, while maintaining fast inference suitable for autonomous driving systems.

Abstract

Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at https://github.com/markusschoen/MGNiceNet.

Paper Structure

This paper contains 29 sections, 13 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our MGNiceNet architecture. Images are fed into the Feature Encoder (\ref{['sec:feat_enc']}) to produce high-resolution feature maps $\mathbf{F}^{\mathrm{p}}$ and $\mathbf{F}^{\mathrm{d}}$. Next, the panoptic kernels $\mathbf{K}^{\mathrm{p}}_0$, self-supervised depth kernels $\mathbf{K}^{\mathrm{d}}_0$, and the panoptic masks $\mathbf{M}^{\mathrm{p}}_0$ are initialized (\ref{['sec:mask_init']}). Kernels and masks are updated iteratively in the four linked kernel update heads (\ref{['sec:linked_kernels']}). Each head contains , , , and stages and a self-supervised depth predictor (\ref{['sec:depth_pred']}), which converts depth mask predictions $\mathbf{M}^{\mathrm{d}}_i$ and depth group features $\mathbf{X}^{\mathrm{d}}_i$ into an inverse depth prediction $\mathbf{\hat{D}}^{\mathrm{d}}_i$. A post-processing stage (\ref{['sec:post_proc']}) converts mask predictions $\mathbf{M}^{\mathrm{p}}_4$, class probability logits $\mathbf{p}_4$, and inverse depth predictions $\mathbf{\hat{D}}^{\mathrm{d}}_4$ into the final panoptic prediction $\textbf{P}$ and depth prediction $\mathbf{D}$. To improve optimization (\ref{['sec:optimization']}), we introduce a panoptic-guided motion masking method (\ref{['sec:motion_mask']}), which calculates a motion mask to mask out dynamic objects from the photometric loss.
  • Figure 2: Visualization of our self-supervised depth predictor module in (a) and our panoptic-guided motion masking in (b).
  • Figure 3: Effect of our panoptic-guided motion masking. While the model predicts holes of infinite depth on pixels corresponding to the car driving with a similar velocity as the ego vehicle, the effect is reduced significantly when using our panoptic-guided motion masking method during training.