RobustMVS: Single Domain Generalized Deep Multi-view Stereo
Hongbin Xu, Weitao Chen, Baigui Sun, Xuansong Xie, Wenxiong Kang
TL;DR
RobustMVS addresses domain generalization in deep MVS under a single-source regime by enforcing local cross-view feature invariance. It introduces Depth-Clustering-guided Whitening (DCW), which clusters depth-informed regions and applies whitening on locally corresponding features using homography warping, integrated into an overall loss: $L = L_{ ext{depth}} + rac{\lambda}{(N-1)LK} sum_{n=1}^{N-1} sum_{l=1}^{L} sum_{k=1}^{K_{ ext{clu}}} L_{ ext{DCW}}^{l,k}$. The approach relies on Instance Normalization and minimal backbone modifications to CasMVSNet (and related backbones) to improve domain robustness, yielding superior generalization across DTU, BlendedMVS, GTASFM, Tanks&Temples, and PASMVS while maintaining competitive TT performance. The results highlight the importance of locality in feature whitening for MVS and point toward future integration with Transformer-based architectures for even stronger domain-generalized 3D reconstruction.
Abstract
Despite the impressive performance of Multi-view Stereo (MVS) approaches given plenty of training samples, the performance degradation when generalizing to unseen domains has not been clearly explored yet. In this work, we focus on the domain generalization problem in MVS. To evaluate the generalization results, we build a novel MVS domain generalization benchmark including synthetic and real-world datasets. In contrast to conventional domain generalization benchmarks, we consider a more realistic but challenging scenario, where only one source domain is available for training. The MVS problem can be analogized back to the feature matching task, and maintaining robust feature consistency among views is an important factor for improving generalization performance. To address the domain generalization problem in MVS, we propose a novel MVS framework, namely RobustMVS. A DepthClustering-guided Whitening (DCW) loss is further introduced to preserve the feature consistency among different views, which decorrelates multi-view features from viewpoint-specific style information based on geometric priors from depth maps. The experimental results further show that our method achieves superior performance on the domain generalization benchmark.
