MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

Ionuţ Grigore; Călin-Adrian Popa

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

Ionuţ Grigore, Călin-Adrian Popa

TL;DR

MambaDepth tackles the bottleneck of long-range dependency modeling in self-supervised monocular depth estimation by integrating pure Mamba-based blocks into a U-Net–style encoder–decoder with skip connections. The MD block, SS2D, and an encoder–decoder integration strategy enable linear scaling and efficient global context capture, leading to state-of-the-art KITTI performance and strong zero-shot generalization to Make3D and Cityscapes. The study demonstrates through comprehensive experiments and ablations that ImageNet pretraining further boosts accuracy, highlighting the practical impact of strong initial representations for depth learning without ground-truth supervision. Overall, MambaDepth signals a shift toward efficient, long-range, SSM-informed backbones for high-quality self-supervised depth estimation in real-world scenarios.

Abstract

In the field of self-supervised depth estimation, Convolutional Neural Networks (CNNs) and Transformers have traditionally been dominant. However, both architectures struggle with efficiently handling long-range dependencies due to their local focus or computational demands. To overcome this limitation, we present MambaDepth, a versatile network tailored for self-supervised depth estimation. Drawing inspiration from the strengths of the Mamba architecture, renowned for its adept handling of lengthy sequences and its ability to capture global context efficiently through a State Space Model (SSM), we introduce MambaDepth. This innovative architecture combines the U-Net's effectiveness in self-supervised depth estimation with the advanced capabilities of Mamba. MambaDepth is structured around a purely Mamba-based encoder-decoder framework, incorporating skip connections to maintain spatial information at various levels of the network. This configuration promotes an extensive feature learning process, enabling the capture of fine details and broader contexts within depth maps. Furthermore, we have developed a novel integration technique within the Mamba blocks to facilitate uninterrupted connectivity and information flow between the encoder and decoder components, thereby improving depth accuracy. Comprehensive testing across the established KITTI dataset demonstrates MambaDepth's superiority over leading CNN and Transformer-based models in self-supervised depth estimation task, allowing it to achieve state-of-the-art performance. Moreover, MambaDepth proves its superior generalization capacities on other datasets such as Make3D and Cityscapes. MambaDepth's performance heralds a new era in effective long-range dependency modeling for self-supervised depth estimation.

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 9 figures, 5 tables)

This paper contains 18 sections, 13 equations, 9 figures, 5 tables.

Introduction
Related work
Supervised Depth Estimation
Self-supervised Depth Estimation
State Space Models
Method
Self-supervised framework
MambaDepth
MD block
Loss function
Experiments
Datasets and Experimental Protocol
Implementation Details
KITTI Results
Cityscapes Results
...and 3 more sections

Figures (9)

Figure 1: Our method's typical predictions on images from the KITTI dataset exhibit superior performance when compared to the classical Monodepth2 key-17 and the contemporary attempts to use Transformers key-73 or self-attention mechanism key-74 in self-supervised monocular depth estimation. Notably, our approach excels in recovering intricate scene details.
Figure 2: Overview of our self-supervised framework. Our proposed MambaDepth adopts a U-Net architecture, leveraging MambaDepth blocks from encoder to obtain low-resolution feature maps of the current frame $I_{t}$. Subsequently, low-resolution feature maps traverse successive MambaDepth blocks from the decoder together with skip connections in order to obtain disparities after applying a final Sigmoid layer. The predicted disparities are then upsampled at various scales to match the original input resolutions. Additionally, a standard pose network utilizes temporally adjacent frames $I_{t}$ and $I_{t-1}$ as input, yielding relative pose $T_{t-1\rightarrow t}$ as output. The camera pose is solely required during training for conducting differentiable warping. In line with numerous prior studies, we employ pixels from frame $I_{t-1}$ to reconstruct frame $I_{t}$ using the depth map $D_{t}$ and relative pose $T_{t-1\rightarrow t}$ through a differentiable warping process key-30. The loss function is formulated based on the differences between the warped image $I_{t-1\rightarrow t}$ and the source image $I_{t}$.
Figure 3: Overview of MambaDepth architecture. The MambaDepth structure includes an encoder, bottleneck, a decoder, and skip connections. Each of these components -- the encoder, bottleneck, and decoder -- is built using the MD block.
Figure 4: The scan expanding and scan merging operations in SS2D. In the SS2D method, input patches follow four distinct scanning paths. Each sequence is then independently processed by separate S6 blocks. Finally, the results are combined to create a 2D feature map, which serves as the final output.
Figure 5: The detailed structure of the MD (MambaDepth) Block.
...and 4 more figures

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

TL;DR

Abstract

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)