MGNiceNet: Unified Monocular Geometric Scene Understanding
Markus Schön, Michael Buchholz, Klaus Dietmayer
TL;DR
MGNiceNet addresses the need for real-time monocular geometric scene understanding by unifying panoptic segmentation and self-supervised depth estimation. It extends the real-time RT-K-Net with four linked kernel update heads and a lightweight depth predictor that operates at the panoptic mask level, enabling explicit cross-task coupling. A panoptic-guided motion masking strategy mitigates dynamic-object interference during self-supervised training, improving depth accuracy without requiring video panoptic annotations. Through extensive experiments on Cityscapes and KITTI, MGNiceNet achieves state-of-the-art real-time panoptic performance and competitive depth accuracy, while maintaining fast inference suitable for autonomous driving systems.
Abstract
Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at https://github.com/markusschoen/MGNiceNet.
