Table of Contents
Fetching ...

Learning 3D Perception from Others' Predictions

Jinsu Yoo, Zhenyang Feng, Tai-Yu Pan, Yihong Sun, Cheng Perng Phoo, Xiangyu Chen, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao

TL;DR

This work introduces a label-efficient paradigm for building 3D detectors by learning from nearby reference units' predictions. It identifies viewpoint mismatch and mislocalization as core challenges and proposes R&B-POP, combining a learnable box ranker for refinement with a distance-based curriculum for self-training to progressively improve pseudo-label quality. Extensive experiments on V2V4Real (and OPV2V) show large gains over naive labeling and strong performance even when sensors, detectors, or domains differ, with the approach approaching upper-bound performance using ego-ground-truth labels. The method offers a practical path for multi-agent 3D perception where raw data sharing is impractical, enabling scalable, cross-vehicle collaboration in real-world settings.

Abstract

Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions.

Learning 3D Perception from Others' Predictions

TL;DR

This work introduces a label-efficient paradigm for building 3D detectors by learning from nearby reference units' predictions. It identifies viewpoint mismatch and mislocalization as core challenges and proposes R&B-POP, combining a learnable box ranker for refinement with a distance-based curriculum for self-training to progressively improve pseudo-label quality. Extensive experiments on V2V4Real (and OPV2V) show large gains over naive labeling and strong performance even when sensors, detectors, or domains differ, with the approach approaching upper-bound performance using ego-ground-truth labels. The method offers a practical path for multi-agent 3D perception where raw data sharing is impractical, enabling scalable, cross-vehicle collaboration in real-world settings.

Abstract

Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions.
Paper Structure (30 sections, 17 figures, 11 tables)

This paper contains 30 sections, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Research problem of learning from others' predictions. We study the scenario where an agent ( e.g., ego car) leverages the predictions made by another agent ( e.g., a high-end reference car) as supervision to train its own 3D object detector. We observe two challenges: (1) viewpoint mismatch between two cars and (2) mislocalization due to synchronization/GPS errors.
  • Figure 2: Label quality in recall and precision at IoU 0.5 with E's GT. Our methods improve the label quality significantly.
  • Figure 3: Point and box discrepancies between ego and reference cars on the real dataset xu2023v2v4real.
  • Figure 4: Mislocalization between E's and R's GT.
  • Figure 5: Box ranker for refining localization error. With a few annotated frames (or boxes), we train a ranker that can estimate the quality of a given box. During inference for pseudo labels, we sample multiple candidates near the initial noisy box and choose the one with the best IoU.
  • ...and 12 more figures