SDL-MVS: View Space and Depth Deformable Learning Paradigm for Multi-View Stereo Reconstruction in Remote Sensing
Yong-Qiang Mao, Hanbo Bi, Liangyu Xu, Kaiqiang Chen, Zhirui Wang, Xian Sun, Kun Fu
TL;DR
This paper tackles depth estimation in large-scale remote sensing multi-view stereo by addressing occlusion and uneven brightness across views. It introduces SDL-MVS, a view-space and depth deformable learning paradigm that combines Progressive Space Deformable Sampling (PSS) with Depth Hypothesis Deformable Discretization (DHD) to deformably sample features across 3D frustum and 2D image spaces and to adapt depth priors through deformable range and interval discretization. The method delivers state-of-the-art results on LuoJia-MVS and WHU datasets, achieving low MAE (e.g., ~0.086 m for 3 views on LuoJia-MVS) and high accuracy across <0.6 m and <3-interval metrics, for both 3-view and 5-view inputs. The work demonstrates strong improvements in both quantitative metrics and qualitative reconstructions, emphasizing robust performance under occlusion and illumination variations, with practical implications for large-scale urban 3D mapping and remote sensing applications.
Abstract
Research on multi-view stereo based on remote sensing images has promoted the development of large-scale urban 3D reconstruction. However, remote sensing multi-view image data suffers from the problems of occlusion and uneven brightness between views during acquisition, which leads to the problem of blurred details in depth estimation. To solve the above problem, we re-examine the deformable learning method in the Multi-View Stereo task and propose a novel paradigm based on view Space and Depth deformable Learning (SDL-MVS), aiming to learn deformable interactions of features in different view spaces and deformably model the depth ranges and intervals to enable high accurate depth estimation. Specifically, to solve the problem of view noise caused by occlusion and uneven brightness, we propose a Progressive Space deformable Sampling (PSS) mechanism, which performs deformable learning of sampling points in the 3D frustum space and the 2D image space in a progressive manner to embed source features to the reference feature adaptively. To further optimize the depth, we introduce Depth Hypothesis deformable Discretization (DHD), which achieves precise positioning of the depth prior by adaptively adjusting the depth range hypothesis and performing deformable discretization of the depth interval hypothesis. Finally, our SDL-MVS achieves explicit modeling of occlusion and uneven brightness faced in multi-view stereo through the deformable learning paradigm of view space and depth, achieving accurate multi-view depth estimation. Extensive experiments on LuoJia-MVS and WHU datasets show that our SDL-MVS reaches state-of-the-art performance. It is worth noting that our SDL-MVS achieves an MAE error of 0.086, an accuracy of 98.9% for <0.6m, and 98.9% for <3-interval on the LuoJia-MVS dataset under the premise of three views as input.
