RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Xiaomeng Chu; Jiajun Deng; Guoliang You; Yifan Duan; Yao Li; Yanyong Zhang

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Yao Li, Yanyong Zhang

TL;DR

RayFormer is introduced, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras and designs a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view.

Abstract

The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves superior performance of 55.5% mAP and 63.3% NDS, respectively.

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 6 figures, 7 tables)

This paper contains 14 sections, 5 equations, 6 figures, 7 tables.

Introduction
Related work
Approach
Overall Framework
BEV Feature Generation
Query Initialization
Ray Sampling
2D Guided Foreground Query Supplement
Experiment
Datasets and Metrics
Implementation Details
Main Results
Ablation Studies
Conclusion

Figures (6)

Figure 1: Comparison of query initialization methods: (a) Grid layout results in multiple queries from the instance frustum being projected onto the same object, yielding similar features. (b) Radial initialization mimics optical imaging principles, reducing queries projected onto the same object.
Figure 2: Overall architecture of RayFormer. Upon inputting multiple frames of multi-camera images into the image encoder, we extract multi-scale image features. These features are processed by a 2D detection head and a depth head to obtain 2D bounding boxes (bboxes) and depth distributions $D'$, respectively. The image features and depth distributions are fed into the lift-splat module for forward projection to generate BEV features. We expand the height of the detected 2D bboxes and use them to select foreground rays. On these rays, a specific number of foreground queries are selected. Along with the radially distributed base queries, all queries are fed into the transformer decoder and refined $L$ times. The core module of the decoder, ray sampling, sets the sampling points along the camera ray, extracting both image and BEV features. Finally, queries are decoded by the classification head and the regression head for accurate predictions.
Figure 3: For $N$ queries, we add $K$ equally spaced points and ray-point offsets to create $N \times K$ adaptive ray points, which are then wrapped to $T$ frames. By incorporating these adaptive ray points and the $P$ sampling offsets generated for each query in Cartesian coordinates, we compile $N \times T \times K \times P$ ray sampling points to aggregate image and BEV features.
Figure 4: The projection of points on camera rays and the selection of foreground rays. (a) Points located on the same camera ray (indicated by same colors) project onto nearly vertical lines within the image. (b) The BEV plane, segmented by rays whose count depends on category sizes, designates rays hitting the expanded areas of category-specific 2D bounding boxes as foreground rays.
Figure 5: Visualization of RayFormer. In the BEV diagram (right), ground truth and predicted outcomes are depicted with green and blue rectangles, respectively. Instances of missed detection boxes are highlighted with red circles.
...and 1 more figures

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

TL;DR

Abstract

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Authors

TL;DR

Abstract

Table of Contents

Figures (6)