Table of Contents
Fetching ...

Multimodal Foundational Models for Unsupervised 3D General Obstacle Detection

Tamás Matuszka, Péter Hajas, Dávid Szeghy

TL;DR

The paper addresses detecting general road obstacles beyond predefined categories in autonomous driving. It introduces a training-free, offline hybrid approach that combines multimodal foundational models for image-space obstacle segmentation with unsupervised computational-geometry-based 3D localization. Key contributions include integrating Grounding DINO and Segment Anything for general obstacle cueing, a four-step unsupervised 3D detector with ground-plane removal, clustering, and tracking, and a new annotated obstacle dataset. Results show obstacle detection up to 100 meters with localization errors around 0.65 m longitudinal and 0.57 m lateral, while highlighting limitations for non-reflective distant objects and road-mask inaccuracies, guiding future improvements such as leveraging ground-plane depth and road-mask corrections.

Abstract

Current autonomous driving perception models primarily rely on supervised learning with predefined categories. However, these models struggle to detect general obstacles not included in the fixed category set due to their variability and numerous edge cases. To address this issue, we propose a combination of multimodal foundational model-based obstacle segmentation with traditional unsupervised computational geometry-based outlier detection. Our approach operates offline, allowing us to leverage non-causality, and utilizes training-free methods. This enables the detection of general obstacles in 3D without the need for expensive retraining. To overcome the limitations of publicly available obstacle detection datasets, we collected and annotated our dataset, which includes various obstacles even in distant regions.

Multimodal Foundational Models for Unsupervised 3D General Obstacle Detection

TL;DR

The paper addresses detecting general road obstacles beyond predefined categories in autonomous driving. It introduces a training-free, offline hybrid approach that combines multimodal foundational models for image-space obstacle segmentation with unsupervised computational-geometry-based 3D localization. Key contributions include integrating Grounding DINO and Segment Anything for general obstacle cueing, a four-step unsupervised 3D detector with ground-plane removal, clustering, and tracking, and a new annotated obstacle dataset. Results show obstacle detection up to 100 meters with localization errors around 0.65 m longitudinal and 0.57 m lateral, while highlighting limitations for non-reflective distant objects and road-mask inaccuracies, guiding future improvements such as leveraging ground-plane depth and road-mask corrections.

Abstract

Current autonomous driving perception models primarily rely on supervised learning with predefined categories. However, these models struggle to detect general obstacles not included in the fixed category set due to their variability and numerous edge cases. To address this issue, we propose a combination of multimodal foundational model-based obstacle segmentation with traditional unsupervised computational geometry-based outlier detection. Our approach operates offline, allowing us to leverage non-causality, and utilizes training-free methods. This enables the detection of general obstacles in 3D without the need for expensive retraining. To overcome the limitations of publicly available obstacle detection datasets, we collected and annotated our dataset, which includes various obstacles even in distant regions.
Paper Structure (20 sections, 6 figures)

This paper contains 20 sections, 6 figures.

Figures (6)

  • Figure 1: The architecture of proposed obstacle detection method. The upper side of the figure depicts the foundational model-based general obstacle segmentation while the lower part is the unsupervised offline detector that is responsible for localization in 3D space.
  • Figure 2: One result of the foundational model-based obstacle segmentation method. Left: bounding box determined by Grounding DINO prompted with the word 'road'. Middle: road mask determined by SAM prompted by the bounding box given by Grounding DINO. Right: obstacle candidate mask with an initial depth estimation from projected LiDAR points.
  • Figure 3: Quantitative results of the proposed method with bird's-eye view heatmaps.
  • Figure 4: Qualitative results of the naive method.
  • Figure 5: Additional evaluation results of the proposed method.
  • ...and 1 more figures