$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Ruizhi Wang; Weihan Li; Zunlei Feng; Haofei Zhang; Mingli Song; Jiayu Wang; Jie Song; Li Sun

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Ruizhi Wang, Weihan Li, Zunlei Feng, Haofei Zhang, Mingli Song, Jiayu Wang, Jie Song, Li Sun

Abstract

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Abstract

-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that

-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

Paper Structure (25 sections, 4 equations, 4 figures, 2 tables)

This paper contains 25 sections, 4 equations, 4 figures, 2 tables.

Introduction
Related Works
ViT-based Monocular Depth Estimation
Diffusion-based Monocular Depth Estimation
VAE for Diffusion Models
Method
Overview
Preliminary Scene Structuring
Progressive Detail Refinement
Efficient Diffusion backbone.
Progressive Linear Blending Refinement.
Experiment
Experimental Settings
Benchmark and Metrics.
Implementation Details.
...and 10 more sections

Figures (4)

Figure 1: The difference between $D^3$-RSMDE and Marigold. Compared to the multiple denoising reconstructions of Marigold, our $D^3$-RSMDE firstly adopts efficient ViT to regression coarse depth map and then obtain fine-grained high-fidelity depth map with fewer denoising steps.
Figure 2: The framework of our $D^3$-RSMDE. During the training process, ViT first performs regression on the input original remote sensing images $x$ to obtain the coarse depth map construction $d_c$, and then together with Ground Truth $d_0$ and the $x$, obtains the samples for training Refiner Diffusion through PLBR. In the inference process, the $d_0$ is replaced by the output of each step of Refiner Diffusion to obtain a refined and high-fidelity remote sensing depth estimation map.
Figure 3: Comparison of model efficiency.
Figure 4: Comparison of our $D^3$-RSMDE and some SOTA methods in different categories of remote sensing images.

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Abstract

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Authors

Abstract

Table of Contents

Figures (4)