Table of Contents
Fetching ...

StereoMV2D: A Sparse Temporal Stereo-Enhanced Framework for Robust Multi-View 3D Object Detection

Di Wu, Feng Yang, Wenhui Zhao, Jinwen Yu, Pan Liao, Benlian Xu, Dingwen Zhang

TL;DR

StereoMV2D integrates RoI-level temporal stereo into a sparse query multi-view 3D detector to address depth ambiguity while preserving efficiency. It introduces Motion-Aware Soft Matching and RoI-Level Temporal Stereo Matching to generate depth-aware priors, complemented by a dynamic confidence gate that robustly fuses monocular and stereo cues. The approach yields significant accuracy gains on nuScenes and Argoverse 2 with acceptable runtime, underscoring the value of object-centric temporal geometry in sparse 3D detection. This work advances efficient, depth-aware multi-view perception for autonomous driving and points to future multimodal and multi-frame extensions.

Abstract

Multi-view 3D object detection is a fundamental task in autonomous driving perception, where achieving a balance between detection accuracy and computational efficiency remains crucial. Sparse query-based 3D detectors efficiently aggregate object-relevant features from multi-view images through a set of learnable queries, offering a concise and end-to-end detection paradigm. Building on this foundation, MV2D leverages 2D detection results to provide high-quality object priors for query initialization, enabling higher precision and recall. However, the inherent depth ambiguity in single-frame 2D detections still limits the accuracy of 3D query generation. To address this issue, we propose StereoMV2D, a unified framework that integrates temporal stereo modeling into the 2D detection-guided multi-view 3D detector. By exploiting cross-temporal disparities of the same object across adjacent frames, StereoMV2D enhances depth perception and refines the query priors, while performing all computations efficiently within 2D regions of interest (RoIs). Furthermore, a dynamic confidence gating mechanism adaptively evaluates the reliability of temporal stereo cues through learning statistical patterns derived from the inter-frame matching matrix together with appearance consistency, ensuring robust detection under object appearance and occlusion. Extensive experiments on the nuScenes and Argoverse 2 datasets demonstrate that StereoMV2D achieves superior detection performance without incurring significant computational overhead. Code will be available at https://github.com/Uddd821/StereoMV2D.

StereoMV2D: A Sparse Temporal Stereo-Enhanced Framework for Robust Multi-View 3D Object Detection

TL;DR

StereoMV2D integrates RoI-level temporal stereo into a sparse query multi-view 3D detector to address depth ambiguity while preserving efficiency. It introduces Motion-Aware Soft Matching and RoI-Level Temporal Stereo Matching to generate depth-aware priors, complemented by a dynamic confidence gate that robustly fuses monocular and stereo cues. The approach yields significant accuracy gains on nuScenes and Argoverse 2 with acceptable runtime, underscoring the value of object-centric temporal geometry in sparse 3D detection. This work advances efficient, depth-aware multi-view perception for autonomous driving and points to future multimodal and multi-frame extensions.

Abstract

Multi-view 3D object detection is a fundamental task in autonomous driving perception, where achieving a balance between detection accuracy and computational efficiency remains crucial. Sparse query-based 3D detectors efficiently aggregate object-relevant features from multi-view images through a set of learnable queries, offering a concise and end-to-end detection paradigm. Building on this foundation, MV2D leverages 2D detection results to provide high-quality object priors for query initialization, enabling higher precision and recall. However, the inherent depth ambiguity in single-frame 2D detections still limits the accuracy of 3D query generation. To address this issue, we propose StereoMV2D, a unified framework that integrates temporal stereo modeling into the 2D detection-guided multi-view 3D detector. By exploiting cross-temporal disparities of the same object across adjacent frames, StereoMV2D enhances depth perception and refines the query priors, while performing all computations efficiently within 2D regions of interest (RoIs). Furthermore, a dynamic confidence gating mechanism adaptively evaluates the reliability of temporal stereo cues through learning statistical patterns derived from the inter-frame matching matrix together with appearance consistency, ensuring robust detection under object appearance and occlusion. Extensive experiments on the nuScenes and Argoverse 2 datasets demonstrate that StereoMV2D achieves superior detection performance without incurring significant computational overhead. Code will be available at https://github.com/Uddd821/StereoMV2D.

Paper Structure

This paper contains 20 sections, 21 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of different paradigms for query-based multi-view 3D object detection. (a) DETR3D adopts numerous learnable queries without any location priors. (b) MV2D initializes sparse queries using 2D detections, improving localization but still suffering from monocular depth ambiguity. (c) StereoMV2D (ours) integrates RoI-level temporal stereo into a sparse query framework to enhance depth reasoning.
  • Figure 2: Overall architecture of StereoMV2D. Given multi-view images from adjacent timestamps, the model first extracts image features and obtains RoI features corresponding to 2D detections. These RoI features, together with the historical queries, are fed into the sparse temporal stereo query generator. The generator begins by performing motion-aware soft matching to compute a matching matrix between objects across frames. After identifying RoI pairs from adjacent timestamps, an RoI-level temporal stereo matching module constructs cost volumes between all matched RoI pairs to predict object depths. To handle newly appearing objects and occlusion cases, we retain the monocular implicit query generator and integrate the 3D reference point proposals from the monocular and stereo branches using a dynamic confidence gating strategy. The fused depth-aware queries, equipped with strong positional priors, are then refined through interactions with RoI features via a sparse decoder, and finally passed through the detection head to produce 3D predictions.
  • Figure 3: Visualization of query locations generated by different query initialization methods. The first column corresponds to the fixed-number learnable query initialization, the second column shows the single-frame implicit query generator from MV2D, and the third column presents our proposed sparse temporal-stereo–based query generation method.
  • Figure 4: Visualization of 3D detection results. On the left, we compare the BEV predictions of our method with those of a representative dense-BEV 3D detection baseline. Green boxes denote ground truth, and blue boxes denote predictions. The first column shows the baseline results, while the second column presents StereoMV2D’s outputs. Regions highlighted with red rectangles indicate areas with noticeably improved localization or reduced false positives. On the right, we visualize the 3D detection results of StereoMV2D across multiple camera views, where bounding boxes of different colors correspond to different object categories.