Table of Contents
Fetching ...

Adjacent-view Transformers for Supervised Surround-view Depth Estimation

Xianda Guo, Wenjie Yuan, Yunpeng Zhang, Tian Yang, Chenming Zhang, Zheng Zhu, Qin Zou, Long Chen

TL;DR

This work addresses the challenge of supervised 360° surround-view depth estimation by leveraging cross-view correlations across multiple cameras. It introduces Adjacent-View Transformers for Supervised Surround-view Depth estimation (AVT-SSDepth), a three-part pipeline consisting of a CNN-Transformer based Feature Extractor, an Adjacent-view Attention module that alternates self-attention within a view and cross-view attention across neighboring cameras, and a Depth Head that predicts all $N$ surround-view depth maps jointly. The model uses LiDAR supervision with a depth loss and an edge-aware smoothness term, and employs an MPViT-small encoder with $Z=8$ adjacent-view layers. Across nuScenes and DDAD, AVT-SSDepth achieves state-of-the-art results and demonstrates strong cross-dataset generalization, offering a practical baseline for reliable surround-view depth in robotics and autonomous driving.

Abstract

Depth estimation has been widely studied and serves as the fundamental step of 3D perception for robotics and autonomous driving. Though significant progress has been made in monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose an Adjacent-View Transformer for Supervised Surround-view Depth estimation (AVT-SSDepth), to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module that combines CNN with transformer layers for enriched representations. Further, the adjacent-view attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surroundview feature maps. In addition, AVT-SSDepth has strong crossdataset generalization. Extensive experiments show that our method achieves superior performance over existing state-ofthe-art methods on both DDAD and nuScenes datasets. Code is available at https://github.com/XiandaGuo/SSDepth.

Adjacent-view Transformers for Supervised Surround-view Depth Estimation

TL;DR

This work addresses the challenge of supervised 360° surround-view depth estimation by leveraging cross-view correlations across multiple cameras. It introduces Adjacent-View Transformers for Supervised Surround-view Depth estimation (AVT-SSDepth), a three-part pipeline consisting of a CNN-Transformer based Feature Extractor, an Adjacent-view Attention module that alternates self-attention within a view and cross-view attention across neighboring cameras, and a Depth Head that predicts all surround-view depth maps jointly. The model uses LiDAR supervision with a depth loss and an edge-aware smoothness term, and employs an MPViT-small encoder with adjacent-view layers. Across nuScenes and DDAD, AVT-SSDepth achieves state-of-the-art results and demonstrates strong cross-dataset generalization, offering a practical baseline for reliable surround-view depth in robotics and autonomous driving.

Abstract

Depth estimation has been widely studied and serves as the fundamental step of 3D perception for robotics and autonomous driving. Though significant progress has been made in monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose an Adjacent-View Transformer for Supervised Surround-view Depth estimation (AVT-SSDepth), to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module that combines CNN with transformer layers for enriched representations. Further, the adjacent-view attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surroundview feature maps. In addition, AVT-SSDepth has strong crossdataset generalization. Extensive experiments show that our method achieves superior performance over existing state-ofthe-art methods on both DDAD and nuScenes datasets. Code is available at https://github.com/XiandaGuo/SSDepth.
Paper Structure (15 sections, 10 equations, 4 figures, 7 tables)

This paper contains 15 sections, 10 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Paradigm comparison of Supervised Monocular Depth, Self-supervised Monocular Depth, Stereo Matching, Multi-View Stereo (MVS), Self-supervised Surround-View Depth and Supervised Surround-View Depth.
  • Figure 2: The architecture of our proposed AVT-SSDepth. Our AVT-SSDepth consists of three parts: Feature Extraction, Adjacent-view Attention, and Depth Head. For feature extraction, both transformer and convolutional layers are adopted to enhance the feature modeling and depth inferring. For Adjacent-view Attention, the alternating self-attention and adjacent-attention are adopted for $Z$ times to fully exchange the information between the current view and the adjacent view. Finally, Depth Head predicts the all-view depth map at the same time.
  • Figure 3: Qualitative results on DDAD 3Dpacknet dataset. The top two rows are input images and the GT depth map. Then, the predicted depth maps are provided. The last row is RGB point cloud visualization. Surr$^\dagger$ refers to SurroundDepth wei2022surround supervised with LiDAR. PF denotes PixelFormer PixelFormer.
  • Figure 4: Plot of Abs Rel in the nuScenes caesar2020nuscenesval set. Adj. stands for Adjacent-view Attention.