CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud; Christian Grannemann; Max Mehltretter

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud, Christian Grannemann, Max Mehltretter

TL;DR

CylinderDepth addresses the challenge of depth inconsistency across overlapping views in self-supervised surround-view depth estimation. It introduces a two-pass network that first predicts per-image depth and then projects the 3D points onto a shared unit cylinder, enabling an explicit, non-learned spatial attention based on cylindrical distances to enforce cross-view consistency. The method leverages spatial, temporal, and spatio-temporal photometric supervision, along with a depth-consistency metric, and demonstrates superior multi-view consistency and depth accuracy on DDAD and nuScenes with a smaller memory footprint than 3D-attention approaches. This approach yields more coherent 360° scene representations, with potential impact on robust 3D perception for autonomous driving and robotics.

Abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

TL;DR

Abstract

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)