Table of Contents
Fetching ...

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai

TL;DR

OPEN tackles the limitation of pixel-wise depth supervision in multi-view 3D detection by introducing object-wise depth and a novel object-wise position embedding. The method combines a Pixel-wise Depth Encoder (PDE), an Object-wise Depth Encoder (ODE) with temporal fusion, and an Object-wise Position Embedding (OPE) to inject 3D center depth information into a transformer-based detector, complemented by a Depth-aware Focal Loss (DFL). Empirical results on nuScenes demonstrate state-of-the-art performance, with notable gains on distant objects and robust ablations showing the effectiveness of each component, especially OPE. The approach yields more accurate 3D object-aware features and improves detection performance while maintaining competitive efficiency, highlighting the value of object-centric depth information in multi-view 3D perception.

Abstract

Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D detectors due to the lack of the depth of 3D object center; 2) for distant objects, fine-grained depth estimation of the whole object is more challenging. Therefore, we argue that the object-wise depth (or 3D center of the object) is essential for accurate detection. In this paper, we propose a new multi-view 3D object detector named OPEN, whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. Specifically, we first employ an object-wise depth encoder, which takes the pixel-wise depth map as a prior, to accurately estimate the object-wise depth. Then, we utilize the proposed object-wise position embedding to encode the object-wise depth information into the transformer decoder, thereby producing 3D object-aware features for final detection. Extensive experiments verify the effectiveness of our proposed method. Furthermore, OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

TL;DR

OPEN tackles the limitation of pixel-wise depth supervision in multi-view 3D detection by introducing object-wise depth and a novel object-wise position embedding. The method combines a Pixel-wise Depth Encoder (PDE), an Object-wise Depth Encoder (ODE) with temporal fusion, and an Object-wise Position Embedding (OPE) to inject 3D center depth information into a transformer-based detector, complemented by a Depth-aware Focal Loss (DFL). Empirical results on nuScenes demonstrate state-of-the-art performance, with notable gains on distant objects and robust ablations showing the effectiveness of each component, especially OPE. The approach yields more accurate 3D object-aware features and improves detection performance while maintaining competitive efficiency, highlighting the value of object-centric depth information in multi-view 3D perception.

Abstract

Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D detectors due to the lack of the depth of 3D object center; 2) for distant objects, fine-grained depth estimation of the whole object is more challenging. Therefore, we argue that the object-wise depth (or 3D center of the object) is essential for accurate detection. In this paper, we propose a new multi-view 3D object detector named OPEN, whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. Specifically, we first employ an object-wise depth encoder, which takes the pixel-wise depth map as a prior, to accurately estimate the object-wise depth. Then, we utilize the proposed object-wise position embedding to encode the object-wise depth information into the transformer decoder, thereby producing 3D object-aware features for final detection. Extensive experiments verify the effectiveness of our proposed method. Furthermore, OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
Paper Structure (22 sections, 8 equations, 8 figures, 11 tables)

This paper contains 22 sections, 8 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: The illustration of object-wise depth prediction. The blue points represent the pixel-wise depth, which is usually distributed on the surface of objects and supervised by projected LiDAR points. The red points represent the object-wise depth, which is the 3D center of the object and is supervised by the accurate center of projected 3D ground truth bounding boxes annotated by humans.
  • Figure 2: The overall architecture of the proposed OPEN, which consists of the pixel-wise depth encoder (PDE), the object-wise depth encoder (ODE), and object-wise position embedding (OPE). Specifically, the PDE first utilizes a DepthNet to predict the pixel-wise depth map supervised by projected LiDAR points. Then, the ODE predicts the object-wise depth, supervised by the center of projected 3D bounding boxes, based on the predicted pixel-wise depth map. Finally, OPEN utilizes the object-wise position embedding based on predicted object-wise depth and corresponding 2D object centers to convert the multi-view image features to object-wise 3D features for interaction with object queries and generate final detection results.
  • Figure 3: The overall architecture of the ODE. ODE first converts image pixels from the pixel coordinate to the camera coordinate and aggregates current and historical features to generate depth embedding for object-wise depth prediction by streaming temporal fusion strategy. Finally, ODE utilizes an FFN to predict the object-wise depth $d$ and corresponding object center $c$ based on the depth embedding.
  • Figure 4: Comparison of the ray-aware position embedding (a), point-aware position embedding (b), and the object-wise position embedding (c). Compared with other methods, OPE utilizes the 3D object center to generate the position embedding, which can achieve better 3D representation.
  • Figure 5: Comparison of attention weight maps for the ray-aware position embedding of StreamPETR (a), the point-aware position embedding of 3DPPE (b), and the object-wise position embedding (c) on the nuScenes val set. After utilizing OPE, our OPEN can generate better attention weight maps for some hard-detected objects, which are highlighted by red circles.
  • ...and 3 more figures