Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

Christian Fruhwirth-Reisinger; Wei Lin; Dušan Malić; Horst Bischof; Horst Possegger

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

Christian Fruhwirth-Reisinger, Wei Lin, Dušan Malić, Horst Bischof, Horst Possegger

TL;DR

ViLGOD introduces a fully unsupervised yet class-aware LiDAR 3D object detector that combines spatio-temporal clustering with vision-language classification. By projecting 3D clusters into multi-view 2D depth maps and applying CLIP to obtain class scores, then propagating labels temporally across tracks, ViLGOD delivers zero-shot, class-aware detections using LiDAR data alone. It achieves strong results on Waymo Open Dataset and Argoverse 2, outperforming prior unsupervised detectors and providing high-quality pseudo-labels to train standard 3D detectors. This approach significantly reduces labeling costs while enabling robust, scalable class-aware 3D perception for autonomous systems.

Abstract

Accurate 3D object detection in LiDAR point clouds is crucial for autonomous driving systems. To achieve state-of-the-art performance, the supervised training of detectors requires large amounts of human-annotated data, which is expensive to obtain and restricted to predefined object categories. To mitigate manual labeling efforts, recent unsupervised object detection approaches generate class-agnostic pseudo-labels for moving objects, subsequently serving as supervision signal to bootstrap a detector. Despite promising results, these approaches do not provide class labels or generalize well to static objects. Furthermore, they are mostly restricted to data containing multiple drives from the same scene or images from a precisely calibrated and synchronized camera setup. To overcome these limitations, we propose a vision-language-guided unsupervised 3D detection approach that operates exclusively on LiDAR point clouds. We transfer CLIP knowledge to classify point clusters of static and moving objects, which we discover by exploiting the inherent spatio-temporal information of LiDAR point clouds for clustering, tracking, as well as box and label refinement. Our approach outperforms state-of-the-art unsupervised 3D object detectors on the Waymo Open Dataset ($+23~\text{AP}_{3D}$) and Argoverse 2 ($+7.9~\text{AP}_{3D}$) and provides class labels not solely based on object size assumptions, marking a significant advancement in the field.

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

TL;DR

Abstract

) and Argoverse 2 (

) and provides class labels not solely based on object size assumptions, marking a significant advancement in the field.

Paper Structure (35 sections, 1 equation, 2 figures, 8 tables)

This paper contains 35 sections, 1 equation, 2 figures, 8 tables.

Introduction
Related Work
Fully supervised LiDAR-based 3D object detection.
Label-efficient 3D object detection.
Unsupervised 3D object detection.
CLIP for 3D understanding.
Vision-Language Guided 3D Object Detection
Unsupervised Object Discovery
Proposal generation.
Temporal coherence.
Vision-Language Guided Object Classification
CLIP preliminary.
Transfer CLIP knowledge for 3D recognition.
Category text refinement.
Multi-view label voting.
...and 20 more sections

Figures (2)

Figure 1: Comparison of point projections. We illustrate two projection examples from LiDAR clusters of the WOD sun_2020_waymo (top row) and sampled CAD models wu20153d (bottom row) evaluated in zhu_2022_pointclipv2, in three different views. While points sampled from CAD models produce consistently good results, LiDAR point cluster projections are negatively affected by incomplete clusters through self-occlusion (car, top left) and sparsity (pedestrian, top right).
Figure 2: ViLGOD overview. After spatio-temporal clustering and filtering, we project 3D point clusters into 2D depth maps, subsequently fed to CLIP for zero-shot recognition. Objects close to the ego-vehicle result in smooth depth-maps, which can be correctly classified with high certainty, e.g the car in the bottom right. Distant objects, on the other hand, are more challenging and require additional context information to improve classification results, e.g the pedestrian on the top right. We omit bounding boxes to enhance clarity.

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

TL;DR

Abstract

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)