QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

Jiawei Yao; Yingxin Lai; Hongrui Kou; Tong Wu; Ruixi Liu

QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

Jiawei Yao, Yingxin Lai, Hongrui Kou, Tong Wu, Ruixi Liu

TL;DR

The paper addresses BEV-based 3D object detection in dynamic scenes and identifies limitations of static and simple dynamic queries in exploiting temporal context. It proposes QE-BEV, which combines Dynamic Query Evolution Module with K-means clustering and Top-K Attention, plus a Lightweight Temporal Fusion Module and Diversity Loss to balance attention while reusing computations. Empirical results on nuScenes and Waymo show state-of-the-art NDS and mAP improvements with improved efficiency, including NDS 56.1 with ResNet50 (57.8 with perspective pretraining) and 61.1 with ResNet101 on nuScenes, and strong Waymo metrics. The work advances BEV-based detectors toward real-time, long-range temporal reasoning in autonomous driving.

Abstract

3D object detection plays a pivotal role in autonomous driving and robotics, demanding precise interpretation of Bird's Eye View (BEV) images. The dynamic nature of real-world environments necessitates the use of dynamic query mechanisms in 3D object detection to adaptively capture and process the complex spatio-temporal relationships present in these scenes. However, prior implementations of dynamic queries have often faced difficulties in effectively leveraging these relationships, particularly when it comes to integrating temporal information in a computationally efficient manner. Addressing this limitation, we introduce a framework utilizing dynamic query evolution strategy, harnesses K-means clustering and Top-K attention mechanisms for refined spatio-temporal data processing. By dynamically segmenting the BEV space and prioritizing key features through Top-K attention, our model achieves a real-time, focused analysis of pertinent scene elements. Our extensive evaluation on the nuScenes and Waymo dataset showcases a marked improvement in detection accuracy, setting a new benchmark in the domain of query-based BEV object detection. Our dynamic query evolution strategy has the potential to push the boundaries of current BEV methods with enhanced adaptability and computational efficiency. Project page: https://github.com/Jiawei-Yao0812/QE-BEV

QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

TL;DR

Abstract

Paper Structure (13 sections, 14 equations, 6 figures, 6 tables)

This paper contains 13 sections, 14 equations, 6 figures, 6 tables.

Introduction
Related Work
Method
Dynamic Query Evolution Module (DQEM)
Lightweight Temporal Fusion Module
Computational Complexity
Experiment
Implementation Details
Datasets and Evaluation Criteria
Comparison with the State-of-the-art Methods
Ablation Study
Visualization
Conclusion

Figures (6)

Figure 1: Comparison of Different Query Methods. (a) Static query-based method: Queries are pre-defined and unchanging during inference, linking to a consistent set of tokens. (b) Existing dynamic query-based method: Queries are adaptive, updating their association with tokens, yet within a limited, position-guided context. (c) Our approach: Fusion of dynamic queries with both feature aggregation and temporal fusion, enabling queries to adapt more comprehensively by considering previous tokens, thereby capturing intricate object dynamics and relationships over time.
Figure 2: The architecture of QE-BEV. Beginning with feature extraction from surrounding images using a backbone network and FPN, the architecture leverages previous pillars and features for temporal context. These are processed through the Dynamic Query Evolution Module (DQEM) for adaptive query refinement using K-means clustering and Top-K Attention. The Lightweight Temporal Fusion Module (LTFM) then integrates temporal information before the final query update, which combines dynamic temporal aggregation with the initial temporal query initialization. Finally, the updated queries are used for 3D object prediction.
Figure 3: Dynamic Query Evolution Module (DQEM). The sequence begins with the initialization of query pillars, which are then spatially coordinated in the BEV space based on extracted features. Subsequent K-means clustering organizes these features into distinct clusters. The process continues with Top-K Attention Aggregation, dynamically refining each query based on the most informative feature clusters. This results in an evolved query set adept at capturing the complex, multi-dimensional relationships.
Figure 4: Comparative visualization of query results for object detection using different dynamic querying methods. Different instances are distinguished by colors. The size of the points indicates depth: larger points are closer to the camera.
Figure 5: Visualization 3D object detection results. Detected objects are highlighted with bounding boxes in the camera views and corresponding position markers in the LiDAR top view.
...and 1 more figures

QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

TL;DR

Abstract

QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

Authors

TL;DR

Abstract

Table of Contents

Figures (6)