Table of Contents
Fetching ...

SDGE: Stereo Guided Depth Estimation for 360$^\circ$ Camera Sets

Jialei Xu, Wei Yin, Dong Gong, Junjun Jiang, Xianming Liu

TL;DR

The authors address the challenge of depth estimation for 360° camera rigs with limited overlap by introducing SGDE, a stereo-guided pipeline that explicitly leverages depth priors from overlapping views. They unify fisheye and pinhole cameras through virtual pinhole transformations, and stabilize pose estimates with a geometry loop constraint to enable robust stereo rectification. The depth prior $D_p$ is used both as an input feature and as a supervision signal via $L_{dp}$ (with $\\lambda=0.005$), improving both supervised and self-supervised depth estimation. Experiments on Synthetic Urban, DDAD, and nuScenes show consistent improvements in depth accuracy and cross-view consistency, and the approach also yields tangible benefits for downstream tasks like 3D object detection and occupancy prediction.

Abstract

Depth estimation is a critical technology in autonomous driving, and multi-camera systems are often used to achieve a 360$^\circ$ perception. These 360$^\circ$ camera sets often have limited or low-quality overlap regions, making multi-view stereo methods infeasible for the entire image. Alternatively, monocular methods may not produce consistent cross-view predictions. To address these issues, we propose the Stereo Guided Depth Estimation (SGDE) method, which enhances depth estimation of the full image by explicitly utilizing multi-view stereo results on the overlap. We suggest building virtual pinhole cameras to resolve the distortion problem of fisheye cameras and unify the processing for the two types of 360$^\circ$ cameras. For handling the varying noise on camera poses caused by unstable movement, the approach employs a self-calibration method to obtain highly accurate relative poses of the adjacent cameras with minor overlap. These enable the use of robust stereo methods to obtain high-quality depth prior in the overlap region. This prior serves not only as an additional input but also as pseudo-labels that enhance the accuracy of depth estimation methods and improve cross-view prediction consistency. The effectiveness of SGDE is evaluated on one fisheye camera dataset, Synthetic Urban, and two pinhole camera datasets, DDAD and nuScenes. Our experiments demonstrate that SGDE is effective for both supervised and self-supervised depth estimation, and highlight the potential of our method for advancing downstream autonomous driving technologies, such as 3D object detection and occupancy prediction.

SDGE: Stereo Guided Depth Estimation for 360$^\circ$ Camera Sets

TL;DR

The authors address the challenge of depth estimation for 360° camera rigs with limited overlap by introducing SGDE, a stereo-guided pipeline that explicitly leverages depth priors from overlapping views. They unify fisheye and pinhole cameras through virtual pinhole transformations, and stabilize pose estimates with a geometry loop constraint to enable robust stereo rectification. The depth prior is used both as an input feature and as a supervision signal via (with ), improving both supervised and self-supervised depth estimation. Experiments on Synthetic Urban, DDAD, and nuScenes show consistent improvements in depth accuracy and cross-view consistency, and the approach also yields tangible benefits for downstream tasks like 3D object detection and occupancy prediction.

Abstract

Depth estimation is a critical technology in autonomous driving, and multi-camera systems are often used to achieve a 360 perception. These 360 camera sets often have limited or low-quality overlap regions, making multi-view stereo methods infeasible for the entire image. Alternatively, monocular methods may not produce consistent cross-view predictions. To address these issues, we propose the Stereo Guided Depth Estimation (SGDE) method, which enhances depth estimation of the full image by explicitly utilizing multi-view stereo results on the overlap. We suggest building virtual pinhole cameras to resolve the distortion problem of fisheye cameras and unify the processing for the two types of 360 cameras. For handling the varying noise on camera poses caused by unstable movement, the approach employs a self-calibration method to obtain highly accurate relative poses of the adjacent cameras with minor overlap. These enable the use of robust stereo methods to obtain high-quality depth prior in the overlap region. This prior serves not only as an additional input but also as pseudo-labels that enhance the accuracy of depth estimation methods and improve cross-view prediction consistency. The effectiveness of SGDE is evaluated on one fisheye camera dataset, Synthetic Urban, and two pinhole camera datasets, DDAD and nuScenes. Our experiments demonstrate that SGDE is effective for both supervised and self-supervised depth estimation, and highlight the potential of our method for advancing downstream autonomous driving technologies, such as 3D object detection and occupancy prediction.
Paper Structure (21 sections, 5 equations, 7 figures, 10 tables)

This paper contains 21 sections, 5 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Multi-view point clouds from predicted depth. SGDE (ours) produces much more consistent point clouds across views. View 3 shows the result of the reconstruction of the ground, our result of different views is on the same plane, while SOTA method (SurroundDepth wei2022surrounddepth) is not.
  • Figure 2: Overlaid stereo rectified images using the relative pose between adjacent two cameras in DDAD dataset guizilini20203d. The rectified images calculated according to the "ground-truth" pose provided by the dataset have a large error in the vertical direction, which indicates the inaccuracy of the dataset-provided parameters.
  • Figure 3: The pipeline of Stereo Guided Depth Estimation (SGDE). For the 360$\degree$ camera set, we first propose the geometry loop constraint to optimize the surrounding camera poses, which ensures the effectiveness of the subsequent geometry scheme. Then, we use the mature stereo rectification and matching algorithms to obtain the depth prior of the overlapping regions. Finally, depth prior is used as both an extra input and the supervisor signal to enhance the depth estimation network.
  • Figure 4: Depth prior in Synthetic Urban won2019sweepnet dataset, which is computed by RAFT-Stereo lipson2021raft.
  • Figure 5: Visualization of self-supervised depth prediction results on 6 cameras at the same frame in DDAD guizilini20203d. Top (RGB): RGB pictures of six cameras. The red box denotes the overlapping region of each camera. The second row: the depth prior of each camera. The third row: results of the baseline R18 godard2019digging. Fourth row: results of SurroundDepth wei2022surrounddepth. The last row (R18 (i$\&$ L1)): results of baseline R18 godard2019digging applied with depth prior as input and L1 supervision.
  • ...and 2 more figures