Table of Contents
Fetching ...

FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

Anqi Joyce Yang, James Tu, Nikita Dvornik, Enxu Li, Raquel Urtasun

TL;DR

FOMO-3D is proposed, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection and demonstrates that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection.

Abstract

In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.

FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

TL;DR

FOMO-3D is proposed, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection and demonstrates that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection.

Abstract

In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.
Paper Structure (40 sections, 11 equations, 15 figures, 14 tables)

This paper contains 40 sections, 11 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: Vision foundation models OWL (left) and Metric3D (middle) show remarkable zero-shot generalization capabilities for 2D object detection and monocular depth estimation. Our model FOMO-3D (right) incorporates these strong priors along with LiDAR for multi-modal 3D object detection.
  • Figure 2: Overview of FOMO-3D, which leverages vision foundation models OWL and Metric3D, and follows a two-stage paradigm with a multi-modal proposal stage and an attention-based refinement stage.
  • Figure 3: [Left] Lifting OWL camera proposals to 3D bounding boxes. We first unproject pixels inside the camera proposal into 3D using Metric3D depths, and then encode the points into a BEV feature map. Each OWL token subsequently attends to fused LiDAR and image BEV features sampled along the frustum. [Right] During supervision, camera proposals are only matched to ground truth boxes inside the object frustum.
  • Figure 4: Real-world class distribution on nuScenes and Highway. Both exhibit severe class imbalances.
  • Figure 5: [Highway] Per-class mAP gains over the base LiDAR-only detector, for distance buckets [0, 50], [50, 200] and [200, 230] meters relatives to the SDV. FOMO-3D (no cam prop) corresponds to $M_2$ in Table \ref{['tab:nusc-rarity-ablation']}.
  • ...and 10 more figures