Table of Contents
Fetching ...

Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

Guodong Sun, Junjie Liu, Mingxuan Liu, Moyun Liu, Yang Zhang

TL;DR

The paper tackles self-supervised monocular depth estimation by addressing representation gaps caused by reliance on a single prior. It introduces a multi-prior framework that combines spatial priors from a hybrid transformer encoder, context priors via a novel CPA module, and semantic priors through a semantic boundary loss and semantic prior attention. The approach demonstrates state-of-the-art performance on KITTI, Make3D, and NYU Depth V2, with strong generalization and favorable computational efficiency. This work highlights the value of integrating multiple priors to enhance depth perception in diverse environments and provides a foundation for further multi-prior fusion and efficient training strategies.

Abstract

Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{https://github.com/MVME-HBUT/MPRLNet}

Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

TL;DR

The paper tackles self-supervised monocular depth estimation by addressing representation gaps caused by reliance on a single prior. It introduces a multi-prior framework that combines spatial priors from a hybrid transformer encoder, context priors via a novel CPA module, and semantic priors through a semantic boundary loss and semantic prior attention. The approach demonstrates state-of-the-art performance on KITTI, Make3D, and NYU Depth V2, with strong generalization and favorable computational efficiency. This work highlights the value of integrating multiple priors to enhance depth perception in diverse environments and provides a foundation for further multi-prior fusion and efficient training strategies.

Abstract

Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{https://github.com/MVME-HBUT/MPRLNet}
Paper Structure (28 sections, 12 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of depth maps. Benefiting from learning multiple priors, the proposed network captures local details and correct contours. In terms of representation, our method is superior to FSRE-Depth fsre and SwinDepth Swindepth. For example, it estimates clearer contours of cyclists and billboards.
  • Figure 2: Overview of the proposed model. The model employs a spatially augmented depth encoder and a lightweight PoseNet to extract spatial priors. The context prior layer in the decoder is used to capture the context priors. Meanwhile, semantic pseudo-labels are introduced into the semantic boundary loss, supplemented by the semantic prior layer to provide semantic prior guidance. The depth decoder consists of a context prior layer, three semantic prior layers, and a depth layer, where the context prior layer and semantic prior layer contain CPA and SPA, respectively. Note that only DepthNet works for inference.
  • Figure 3: Structure of the hybrid transformer layer. Through multi-scale path embedding, receptive fields of different scales can be obtained. Then, the transformer and convolution branches capture global spatial features and local detail features. Subsequently, the global-local feature interaction layer fuses these features.
  • Figure 4: The structure of the proposed context prior attention. CPA uses the spatial branch (cyan) to extract the context information in the spatial dimension while supplementing the relationship in the channel dimension with the channel branch (gray). The $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$ represent the query, key, and value.
  • Figure 5: Visualization of the attention mechanism. We list the visual features of CPA, criss-cross attention ccnet, and the case without attention. The criss-cross attention focuses excessively on close objects. In contrast, our CPA focuses on nearby scenes while highlighting distant objects such as vehicles and trees.
  • ...and 7 more figures