Table of Contents
Fetching ...

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

Yitong Dong, Yijin Li, Zhaoyang Huang, Weikang Bian, Jingbo Liu, Hujun Bao, Zhaopeng Cui, Hongsheng Li, Guofeng Zhang

TL;DR

This work addresses the sensitivity of multi-view stereo to depth priors by proposing a depth-range-free framework that jointly reasons over all source views. It leverages a Multi-view Disparity Attention (MDA) with 3D pose embedding, along with disparity hidden states and uncertainty estimation, to fuse multi-view information through iterative GRU-based refinements of the epipolar disparity flow. The approach achieves state-of-the-art performance among depth-range-free methods on DTU and Tanks&Temples, demonstrating improved robustness to depth-range variations and occlusions. The proposed method enables more reliable 3D reconstruction in real-world scenarios without precise depth priors, with potential benefits for large-scale 3D understanding and visualization.

Abstract

In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and Tanks&Temple benchmark demonstrate the effectiveness of our method. The code is available at our project page: https://zju3dv.github.io/GD-PoseMVS/.

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

TL;DR

This work addresses the sensitivity of multi-view stereo to depth priors by proposing a depth-range-free framework that jointly reasons over all source views. It leverages a Multi-view Disparity Attention (MDA) with 3D pose embedding, along with disparity hidden states and uncertainty estimation, to fuse multi-view information through iterative GRU-based refinements of the epipolar disparity flow. The approach achieves state-of-the-art performance among depth-range-free methods on DTU and Tanks&Temples, demonstrating improved robustness to depth-range variations and occlusions. The proposed method enables more reliable 3D reconstruction in real-world scenarios without precise depth priors, with potential benefits for large-scale 3D understanding and visualization.

Abstract

In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and Tanks&Temple benchmark demonstrate the effectiveness of our method. The code is available at our project page: https://zju3dv.github.io/GD-PoseMVS/.

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The robustness testing on the depth range. Under identical training configurations, our method exhibits superior robustness to variations in depth range compared with two state-of-the-art methods IterMVSDispMVS. The red markings denote the actual depth range used during training.
  • Figure 2: Overview of our method. We introduce the disparity feature encoding module to encode viewpoint quality differences, and the Multi-view Disparity Attention (MDA) module to facilitate information interaction between multi-view images. The MDA module is depicted in Fig. \ref{['fig:MDA']}. Starting from an initial depth map $D_0$, the epipolar disparity flows are iteratively updated and fused to the depth of the next stage.
  • Figure 3: Illustration of MDA module. After concatenating features with 3D pose embedding and 2D normalized positional encoding, we achieve intra-image and inter-image information interaction through self-attention and cross-attention. As shown in the right figure, 3D pose embedding encodes relative pose and pixel geometric information into the features to enhance the learning capability of the attention mechanism.
  • Figure 4: Some qualitative results of the proposed method on DTU and Tanks and Temples datasets.