Table of Contents
Fetching ...

Towards Open-set Camera 3D Object Detection

Zhuolin He, Xinrun Li, Heng Gao, Jiachen Tang, Shoumeng Qiu, Wenfu Wang, Lvjian Lu, Xuchong Qiu, Xiangyang Xue, Jian Pu

TL;DR

OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects, and introduces a Joint Objectness Selection (JOS) module.

Abstract

Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.

Towards Open-set Camera 3D Object Detection

TL;DR

OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects, and introduces a Joint Objectness Selection (JOS) module.

Abstract

Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Proposed OS-Det3D Training Framework. ODN3D denotes the 3D Object Discovery Network. Both top-$k_{o}$ and top-$k_{u}$ are hyperparameters. In stage 1, voxel features from a LiDAR frame are extracted using the LiDAR backbone and input into ODN3D's encoder-decoder along with a set of initial queries. At the decoder output, each 3D object query is processed by 3 different branches. Our objectness branch outputs the geometric confidence $s^{\prime}_{obj}$ of a query being an object. The output of ODN3D is a set of 3D object region proposals. In ground truth (GT) filtering, we filter 3D object region proposals that overlap with the instances of known categories. After GT filtering, the top-$k_{o}$ proposals are selected as 3D object region candidates. In Stage 2, we extract bird's-eye view (BEV) features from the final layer of the camera encoder and perform a channel pooling step. JOS computes the BEV attention value in the BEV region of each 3D object region candidate to select the top-$k_{u}$ candidates as the pseudo-GT of unknown objects. Finally, these pseudo-GT of unknown objects are input to the classification branch of the camera detector.
  • Figure 2: Overview of ODN3D. We retain the original Hungarian matching, classification branch, and regression branch of the transformer-based architecture in ODN3D. At the ODN3D's decoder output, we designed a GenHungarian matching algorithm to sample 3D object queries. Hungarian matching calculates the cost matrix with cost values that are category-dependent, while GeoHungarian matching is category-independent and focuses solely on geometric cues, the positive queries sampled by these two matching strategies may differ. Based on the GenHungarian matching algorithm, we further introduced an objectness branch to form the objectness score $s_{obj}$, which assesses the geometric localization quality of these positive queries.
  • Figure 3: Visuliazation Results on nuScenes Split 2. The qualitative results showcase different outcomes. Row 1 corresponds to the ground truth (GT) with all classes: car, pedestrian, bicycle, barrier, construction vehicle, truck, bus, trailer, motorcycle, traffic cone, debris (trash bins, etc.). Row 2 to BEVFormer(Closed-set) which only focuses on detecting known classes (car, pedestrian, bicycle, barrier, construction vehicle) and row 3 to BEVFormer(OS-Det3D ) which is able to identify unknown instances (truck, bus, trailer, motorcycle, traffic cone, debris). (Zoom in for a better view.)
  • Figure 4: Overview of JOS. JOS computes corresponding attention score $s_{att}$ for $\text{top-}k_{o}$ candidates as the mean attention score within a BEV 2D region-of-interest. Then, we select the $\text{top-}k_{u}$ candidates sorted by $s'_{obj}$ as pseudo ground truth of unknown objects (pseudo-GT).
  • Figure 5: Different adaption of OW-DETR approach.
  • ...and 1 more figures