RobustMVS: Single Domain Generalized Deep Multi-view Stereo

Hongbin Xu; Weitao Chen; Baigui Sun; Xuansong Xie; Wenxiong Kang

RobustMVS: Single Domain Generalized Deep Multi-view Stereo

Hongbin Xu, Weitao Chen, Baigui Sun, Xuansong Xie, Wenxiong Kang

TL;DR

RobustMVS addresses domain generalization in deep MVS under a single-source regime by enforcing local cross-view feature invariance. It introduces Depth-Clustering-guided Whitening (DCW), which clusters depth-informed regions and applies whitening on locally corresponding features using homography warping, integrated into an overall loss: $L = L_{ ext{depth}} + rac{\lambda}{(N-1)LK} sum_{n=1}^{N-1} sum_{l=1}^{L} sum_{k=1}^{K_{ ext{clu}}} L_{ ext{DCW}}^{l,k}$. The approach relies on Instance Normalization and minimal backbone modifications to CasMVSNet (and related backbones) to improve domain robustness, yielding superior generalization across DTU, BlendedMVS, GTASFM, Tanks&Temples, and PASMVS while maintaining competitive TT performance. The results highlight the importance of locality in feature whitening for MVS and point toward future integration with Transformer-based architectures for even stronger domain-generalized 3D reconstruction.

Abstract

Despite the impressive performance of Multi-view Stereo (MVS) approaches given plenty of training samples, the performance degradation when generalizing to unseen domains has not been clearly explored yet. In this work, we focus on the domain generalization problem in MVS. To evaluate the generalization results, we build a novel MVS domain generalization benchmark including synthetic and real-world datasets. In contrast to conventional domain generalization benchmarks, we consider a more realistic but challenging scenario, where only one source domain is available for training. The MVS problem can be analogized back to the feature matching task, and maintaining robust feature consistency among views is an important factor for improving generalization performance. To address the domain generalization problem in MVS, we propose a novel MVS framework, namely RobustMVS. A DepthClustering-guided Whitening (DCW) loss is further introduced to preserve the feature consistency among different views, which decorrelates multi-view features from viewpoint-specific style information based on geometric priors from depth maps. The experimental results further show that our method achieves superior performance on the domain generalization benchmark.

RobustMVS: Single Domain Generalized Deep Multi-view Stereo

TL;DR

. The approach relies on Instance Normalization and minimal backbone modifications to CasMVSNet (and related backbones) to improve domain robustness, yielding superior generalization across DTU, BlendedMVS, GTASFM, Tanks&Temples, and PASMVS while maintaining competitive TT performance. The results highlight the importance of locality in feature whitening for MVS and point toward future integration with Transformer-based architectures for even stronger domain-generalized 3D reconstruction.

Abstract

Paper Structure (28 sections, 26 equations, 7 figures, 7 tables)

This paper contains 28 sections, 26 equations, 7 figures, 7 tables.

Introduction
Related Work
Learning-based MVS
Domain Generalization in Stereo Vision
Preliminary
Whitening Transformation and Whitening Loss
Limitations of Whitening Loss in MVS
Method
Problem Statement and Notation
Backbone
Convolutional Block with Normalization
Depth-Clustering-guided Whitening Loss
Clustering with Depth Priors
Homography Whitening
Overall loss
...and 13 more sections

Figures (7)

Figure 1: Illustration of the proposed MVS domain generalization task. A single source domain is selected for training, and further tested on target domains without finetuning. The MMD Distances among datasets are visualized on the right hand. MVS datasets: DTU (DT) jensen2014large, BlendedMVS (BL) yao2020blendedmvs, GTASFM (GS) wang2020flow, PASMVS (PA) broekman2020pasmvs, Tanks&Temples (TT) knapitsch2017tanks.
Figure 2: Visualization of the activation map of CasMVSNet gu2020cascade via Grad-Cam selvaraju2017grad. GS$\rightarrow$DT means training on domain GS and generalizing to unseen domain DT. DT$\rightarrow$DT means training and testing on the same domain DT.
Figure 3: Illustration of our proposed RobustMVS framework. Aug. means random data augmentation. The blue dashed box represents the clustering process with the depth prior. This process can extract local corresponding regions by clustering, and the clusters are further fed to the proposed Depth-Clustering Guided Whitening loss shown in the purple dashed box. The orange dashed box at the bottom represents the basic architecture of the adopted MVS network, in which the first 3 BN convolution blocks are replaced with the IN-based convolution blocks.
Figure 4: Some examples of the processed ground truths of BL/GS/PA.
Figure 5: Qualitative comparisons with state-of-the-art MVS methods utilizing DT as source domain.
...and 2 more figures

RobustMVS: Single Domain Generalized Deep Multi-view Stereo

TL;DR

Abstract

RobustMVS: Single Domain Generalized Deep Multi-view Stereo

Authors

TL;DR

Abstract

Table of Contents

Figures (7)