Table of Contents
Fetching ...

Towards Scale-Aware Full Surround Monodepth with Transformers

Yuchen Yang, Xinyi Wang, Dong Li, Lu Tian, Ashish Sirasao, Xun Yang

TL;DR

The paper tackles scale-aware depth estimation in full surround monodepth by introducing SA-FSM, a transformer-based framework that fuses cross-view context with a novel Neighbor-enhanced Cross-View Attention (NCA). It further strengthens scale-awareness through a two-round progressive SfM training pipeline, using strong initial correspondences from SuperGlue and subsequent filtering to refine learning. SA-FSM achieves state-of-the-art results on the DDAD benchmark and outperforms prior FSM methods on nuScenes while obviating the need for median-scaling at test time. The combination of improved cross-view feature aggregation and a robust, staged SfM supervision offers practical, fast, and accurate 360-degree depth perception for autonomous systems.

Abstract

Full surround monodepth (FSM) methods can learn from multiple camera views simultaneously in a self-supervised manner to predict the scale-aware depth, which is more practical for real-world applications in contrast to scale-ambiguous depth from a standalone monocular camera. In this work, we focus on enhancing the scale-awareness of FSM methods for depth estimation. To this end, we propose to improve FSM from two perspectives: depth network structure optimization and training pipeline optimization. First, we construct a transformer-based depth network with neighbor-enhanced cross-view attention (NCA). The cross-attention modules can better aggregate the cross-view context in both global and neighboring views. Second, we formulate a transformer-based feature matching scheme with progressive training to improve the structure-from-motion (SfM) pipeline. That allows us to learn scale-awareness with sufficient matches and further facilitate network convergence by removing mismatches based on SfM loss. Experiments demonstrate that the resulting Scale-aware full surround monodepth (SA-FSM) method largely improves the scale-aware depth predictions without median-scaling at the test time, and performs favorably against the state-of-the-art FSM methods, e.g., surpassing SurroundDepth by 3.8% in terms of accuracy at delta<1.25 on the DDAD benchmark.

Towards Scale-Aware Full Surround Monodepth with Transformers

TL;DR

The paper tackles scale-aware depth estimation in full surround monodepth by introducing SA-FSM, a transformer-based framework that fuses cross-view context with a novel Neighbor-enhanced Cross-View Attention (NCA). It further strengthens scale-awareness through a two-round progressive SfM training pipeline, using strong initial correspondences from SuperGlue and subsequent filtering to refine learning. SA-FSM achieves state-of-the-art results on the DDAD benchmark and outperforms prior FSM methods on nuScenes while obviating the need for median-scaling at test time. The combination of improved cross-view feature aggregation and a robust, staged SfM supervision offers practical, fast, and accurate 360-degree depth perception for autonomous systems.

Abstract

Full surround monodepth (FSM) methods can learn from multiple camera views simultaneously in a self-supervised manner to predict the scale-aware depth, which is more practical for real-world applications in contrast to scale-ambiguous depth from a standalone monocular camera. In this work, we focus on enhancing the scale-awareness of FSM methods for depth estimation. To this end, we propose to improve FSM from two perspectives: depth network structure optimization and training pipeline optimization. First, we construct a transformer-based depth network with neighbor-enhanced cross-view attention (NCA). The cross-attention modules can better aggregate the cross-view context in both global and neighboring views. Second, we formulate a transformer-based feature matching scheme with progressive training to improve the structure-from-motion (SfM) pipeline. That allows us to learn scale-awareness with sufficient matches and further facilitate network convergence by removing mismatches based on SfM loss. Experiments demonstrate that the resulting Scale-aware full surround monodepth (SA-FSM) method largely improves the scale-aware depth predictions without median-scaling at the test time, and performs favorably against the state-of-the-art FSM methods, e.g., surpassing SurroundDepth by 3.8% in terms of accuracy at delta<1.25 on the DDAD benchmark.
Paper Structure (25 sections, 5 equations, 3 figures, 5 tables)

This paper contains 25 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of our proposed FSM method. Images across camera views are concatenated and fed to a transformer-based depth network. Three losses are used for training this FSM depth network: (1)$L_{sfm}$ is calculated by using pseudo groundtruth generated from SfM pipeline. (2)$L_{photo}$ is calculated using spatial/temporal frames transformed by predicted pose changes $X_{t+\Delta t}$ and extrinsic matrices. (3) $L_{smooth}$ is calculated on individual depth maps that ensure edge-aware smoothness as commonly used in self-supervised monodepth methodsgodard2017unsupervisedgodard2019digging.
  • Figure 2: Illustration of NCA. Tensor shape (B, N, C, H, W) follows the actual shape of the feature extracted from the encoder.
  • Figure 3: Visualization results. First row: RGB images from six views in DDAD. Second row: Images warped from adjacent views. Third row: $Abs\_rel$ error map predicted by previous SOTA Surrounddepth wei2022surrounddepth. Forth row: $Abs\_rel$ error map predicted by our method, color close to the background blue indicates a better accuracy in groundtruth pixel location. Fifth row: Depth predicted by Surrounddepth. Sixth row: Depth predicted by our model. Areas that shows depth improvements are highlighted with yellow rectangles.