STS: Surround-view Temporal Stereo for Multi-view 3D Detection
Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, Di Huang
TL;DR
<3-5 sentence high-level summary> Surround-view Temporal Stereo (STS) introduces a cross-camera, time-aware stereo paradigm for multi-view 3D detection. By warping features across cameras and time with differentiable homographies and using Spacing-Increasing Discretization (SID), STS yields more accurate depth predictions and BEV representations when fused with a monocular depth module. Extensive nuScenes experiments show consistent improvements in mAP and NDS, especially for mid- to long-range objects, and ablations confirm the contributions of surround-view matching, SID, and depth fusion. The method advances depth learning for RGB-only multi-view detection and demonstrates practical gains for autonomous driving systems.
Abstract
Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness
