Table of Contents
Fetching ...

STS: Surround-view Temporal Stereo for Multi-view 3D Detection

Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, Di Huang

TL;DR

<3-5 sentence high-level summary> Surround-view Temporal Stereo (STS) introduces a cross-camera, time-aware stereo paradigm for multi-view 3D detection. By warping features across cameras and time with differentiable homographies and using Spacing-Increasing Discretization (SID), STS yields more accurate depth predictions and BEV representations when fused with a monocular depth module. Extensive nuScenes experiments show consistent improvements in mAP and NDS, especially for mid- to long-range objects, and ablations confirm the contributions of surround-view matching, SID, and depth fusion. The method advances depth learning for RGB-only multi-view detection and demonstrates practical gains for autonomous driving systems.

Abstract

Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness

STS: Surround-view Temporal Stereo for Multi-view 3D Detection

TL;DR

<3-5 sentence high-level summary> Surround-view Temporal Stereo (STS) introduces a cross-camera, time-aware stereo paradigm for multi-view 3D detection. By warping features across cameras and time with differentiable homographies and using Spacing-Increasing Discretization (SID), STS yields more accurate depth predictions and BEV representations when fused with a monocular depth module. Extensive nuScenes experiments show consistent improvements in mAP and NDS, especially for mid- to long-range objects, and ablations confirm the contributions of surround-view matching, SID, and depth fusion. The method advances depth learning for RGB-only multi-view detection and demonstrates practical gains for autonomous driving systems.

Abstract

Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness
Paper Structure (27 sections, 5 equations, 6 figures, 8 tables)

This paper contains 27 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of different settings for stereo-based depth estimation from 2D RGB images to learn the absolute scale of the world.
  • Figure 2: Visualization of projected sampling points on source images using different depth sampling strategies. UD represents traditional uniform depth sampling, SID represents Spacing-Increasing Discretization.
  • Figure 3: The flowchart of our method. The image features are first lifted into a frustum of features for each camera with the depth fused from monocular depth module and STS. Then all frustums are splatted into a unified Bird’s-Eye-View representation using a pooling operation. The detection head is used to get the final detection results.
  • Figure 4: The detailed architecture of our proposed Surround-view Temporal Stereo (STS).
  • Figure 5: Visualization of regions that STS has positive effects on. Dashed boxes highlight key regions for analysis.
  • ...and 1 more figures