Multimodal Foundational Models for Unsupervised 3D General Obstacle Detection
Tamás Matuszka, Péter Hajas, Dávid Szeghy
TL;DR
The paper addresses detecting general road obstacles beyond predefined categories in autonomous driving. It introduces a training-free, offline hybrid approach that combines multimodal foundational models for image-space obstacle segmentation with unsupervised computational-geometry-based 3D localization. Key contributions include integrating Grounding DINO and Segment Anything for general obstacle cueing, a four-step unsupervised 3D detector with ground-plane removal, clustering, and tracking, and a new annotated obstacle dataset. Results show obstacle detection up to 100 meters with localization errors around 0.65 m longitudinal and 0.57 m lateral, while highlighting limitations for non-reflective distant objects and road-mask inaccuracies, guiding future improvements such as leveraging ground-plane depth and road-mask corrections.
Abstract
Current autonomous driving perception models primarily rely on supervised learning with predefined categories. However, these models struggle to detect general obstacles not included in the fixed category set due to their variability and numerous edge cases. To address this issue, we propose a combination of multimodal foundational model-based obstacle segmentation with traditional unsupervised computational geometry-based outlier detection. Our approach operates offline, allowing us to leverage non-causality, and utilizes training-free methods. This enables the detection of general obstacles in 3D without the need for expensive retraining. To overcome the limitations of publicly available obstacle detection datasets, we collected and annotated our dataset, which includes various obstacles even in distant regions.
