Table of Contents
Fetching ...

UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes

Ted Lentsch, Holger Caesar, Dariu M. Gavrila

TL;DR

UNION tackles the problem of unsupervised 3D object detection by fusing LiDAR, camera, and temporal information to discover both static and dynamic mobile objects without manual labels. It introduces an appearance-based clustering approach on camera-derived features to separate mobile foreground from background clutter, generating pseudo-bounding boxes and pseudo-classes that train existing detectors in a single pass, thus avoiding costly self-training loops. The approach extends 3D object discovery to 3D object detection by using appearance-based pseudo-classes for multi-class training, achieving state-of-the-art results on nuScenes for unsupervised discovery and demonstrating effective class-agnostic and multi-class performance. This work paves the way for scalable, label-free 3D detection in autonomous systems by leveraging multi-modal signals and self-supervised appearance cues.

Abstract

Unsupervised 3D object detection methods have emerged to leverage vast amounts of data without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect mobile objects but penalize the detections of static instances during training. Multiple rounds of self-training are used to add detected static instances to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR. Subsequently, object proposals' visual appearances are encoded to distinguish static objects in the foreground and background by selecting static instances that are visually similar to dynamic objects. As a result, static and dynamic mobile objects are obtained together, and existing detectors can be trained with a single training. In addition, we extend 3D object discovery to detection by using object appearance-based cluster labels as pseudo-class labels for training object classification. We conduct extensive experiments on the nuScenes dataset and increase the state-of-the-art performance for unsupervised 3D object discovery, i.e. UNION more than doubles the average precision to 39.5. The code is available at github.com/TedLentsch/UNION.

UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes

TL;DR

UNION tackles the problem of unsupervised 3D object detection by fusing LiDAR, camera, and temporal information to discover both static and dynamic mobile objects without manual labels. It introduces an appearance-based clustering approach on camera-derived features to separate mobile foreground from background clutter, generating pseudo-bounding boxes and pseudo-classes that train existing detectors in a single pass, thus avoiding costly self-training loops. The approach extends 3D object discovery to 3D object detection by using appearance-based pseudo-classes for multi-class training, achieving state-of-the-art results on nuScenes for unsupervised discovery and demonstrating effective class-agnostic and multi-class performance. This work paves the way for scalable, label-free 3D detection in autonomous systems by leveraging multi-modal signals and self-supervised appearance cues.

Abstract

Unsupervised 3D object detection methods have emerged to leverage vast amounts of data without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect mobile objects but penalize the detections of static instances during training. Multiple rounds of self-training are used to add detected static instances to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR. Subsequently, object proposals' visual appearances are encoded to distinguish static objects in the foreground and background by selecting static instances that are visually similar to dynamic objects. As a result, static and dynamic mobile objects are obtained together, and existing detectors can be trained with a single training. In addition, we extend 3D object discovery to detection by using object appearance-based cluster labels as pseudo-class labels for training object classification. We conduct extensive experiments on the nuScenes dataset and increase the state-of-the-art performance for unsupervised 3D object discovery, i.e. UNION more than doubles the average precision to 39.5. The code is available at github.com/TedLentsch/UNION.
Paper Structure (13 sections, 4 figures, 5 tables)

This paper contains 13 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: UNION discovers mobile objects (e.g. cars, pedestrians, cyclists) in an unsupervised manner by exploiting LiDAR, camera, and temporal information jointly. The key observation is that mobile objects can be distinguished from background objects (e.g. buildings, trees, poles) by grouping object proposals with similar visual appearance, i.e. clustering their appearance embeddings, and selecting appearance clusters that contain at least $X$dynamic instances.
  • Figure 2: Comparison of various designs for unsupervised 3D object discovery.(a) Most object discovery methods exploit LiDAR to generate pseudo-bounding boxes and use these to train a detector in a class-agnostic setting followed by self-training. (b) Wang et al. wang20224d generate pseudo-bounding boxes similar to (a) but alternate between training a LiDAR-based detector and a camera-based detector for self-training. (c) We use multi-modal data for generating pseudo-bounding boxes and pseudo-class labels, and train a multi-class detector without requiring self-training.
  • Figure 3: Qualitative results for (components of) UNION compared to ground truth annotations. (a) HDBSCAN (step 1 in Figure \ref{['fig:teaser_image']}): object proposals (spatial clusters) in black. (b) Scene flow (step 2 in Figure \ref{['fig:teaser_image']}): static and dynamic object proposals in black and red, respectively. (c) UNION: static and dynamic mobile objects in green and red, respectively. (d) Ground truth: mobile objects in blue.
  • Figure 4: Dynamic object proposal fractions of the visual appearance clusters. We use a threshold of 5 for selecting clusters.