Table of Contents
Fetching ...

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud, Christian Grannemann, Max Mehltretter

TL;DR

CylinderDepth addresses the challenge of depth inconsistency across overlapping views in self-supervised surround-view depth estimation. It introduces a two-pass network that first predicts per-image depth and then projects the 3D points onto a shared unit cylinder, enabling an explicit, non-learned spatial attention based on cylindrical distances to enforce cross-view consistency. The method leverages spatial, temporal, and spatio-temporal photometric supervision, along with a depth-consistency metric, and demonstrates superior multi-view consistency and depth accuracy on DDAD and nuScenes with a smaller memory footprint than 3D-attention approaches. This approach yields more coherent 360° scene representations, with potential impact on robust 3D perception for autonomous driving and robotics.

Abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

TL;DR

CylinderDepth addresses the challenge of depth inconsistency across overlapping views in self-supervised surround-view depth estimation. It introduces a two-pass network that first predicts per-image depth and then projects the 3D points onto a shared unit cylinder, enabling an explicit, non-learned spatial attention based on cylindrical distances to enforce cross-view consistency. The method leverages spatial, temporal, and spatio-temporal photometric supervision, along with a depth-consistency metric, and demonstrates superior multi-view consistency and depth accuracy on DDAD and nuScenes with a smaller memory footprint than 3D-attention approaches. This approach yields more coherent 360° scene representations, with potential impact on robust 3D perception for autonomous driving and robotics.

Abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

Paper Structure

This paper contains 24 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of multi-view consistency between our method and CVCDepth ding2024towards. The star and circle denote 3D reconstructions of the same 3D object point from two different images. While prior work struggles to achieve consistency in the reconstruction across images, our method overcomes this limitation.
  • Figure 2: Overview of the proposed network. The depth network takes the target images $\mathbf{I}{_{t}}$ as input. The lowest-scale features $\mathbf{F}{_{S, \mathbf{I}_t}}$ from all target images are projected onto a cylinder, where attention is applied based on cylindrical distances. The pose network takes the source $\mathbf{I}{_{t', 1}}$ and target front $\mathbf{I}{_{t,1}}$ images as input to predict the temporal pose.
  • Figure 3: Visualization of the cylindrical projection of a pixel $p$ from the 3D position map $\mathbf{P}{_{S, \mathbf{I}_{t, i}}}$ resulting in cylindrical position map $\mathbf{O}{_{S, \mathbf{I}_{t, i}}}$ for all pixels in $\mathbf{P}{_{S, \mathbf{I}_{t, i}}}$.
  • Figure 4: Panoramic visualization of the cylindrical projection of RGB inputs. Note that in our method, only pixel positions are projected, not RGB values. This figure is provided solely for illustration, to show how objects captured from different views are mapped to nearby locations in cylindrical coordinates.
  • Figure 5: Attention maps for a query token (indicated by the arrow in the back-left image), as overlays on the respective RGB images, showing that this token attends to itself, nearby regions, and to the corresponding region in the spatially adjacent image. High attention is shown in red, low attention in yellow to blue.
  • ...and 2 more figures