Table of Contents
Fetching ...

MVSAnywhere: Zero-Shot Multi-View Stereo

Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, Jamie Watson

TL;DR

MVSA tackles generalizable depth estimation from multi-view inputs across diverse domains and depth ranges. It introduces a transformer-based architecture that fuses multi-view cost volumes with monocular cues via a Cost Volume Patchifier and a Mono/Multi Cue Combiner, plus a cascaded depth-range strategy and view-count-agnostic metadata. The approach achieves state-of-the-art zero-shot depth on the Robust Multi-View Depth Benchmark and yields metric-scale depths that produce high-quality 3D reconstructions, outperforming both monocular and prior MVS baselines. This work enables robust 3D understanding in uncontrolled, real-world scenarios and provides code and pretrained models for reproducibility.

Abstract

Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.

MVSAnywhere: Zero-Shot Multi-View Stereo

TL;DR

MVSA tackles generalizable depth estimation from multi-view inputs across diverse domains and depth ranges. It introduces a transformer-based architecture that fuses multi-view cost volumes with monocular cues via a Cost Volume Patchifier and a Mono/Multi Cue Combiner, plus a cascaded depth-range strategy and view-count-agnostic metadata. The approach achieves state-of-the-art zero-shot depth on the Robust Multi-View Depth Benchmark and yields metric-scale depths that produce high-quality 3D reconstructions, outperforming both monocular and prior MVS baselines. This work enables robust 3D understanding in uncontrolled, real-world scenarios and provides code and pretrained models for reproducibility.

Abstract

Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.

Paper Structure

This paper contains 49 sections, 6 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Our MVSA model results in high-quality reconstructions from posed images, and is superior to existing monocular and MVS methods. Here we compare with Depth Pro bochkovskii2024depthpro, a recent monocular method which produces sharp and good looking depth maps, but can have inconsistent scaling of depths, which are required for good meshes. We also include a variant of MAST3R mast3r_arxiv24 that we have augmented with ground truth camera poses. Our model gives sharp depth maps which are also accurate and 3D consistent, producing high-quality meshes in zero-shot environments.
  • Figure 2: MVS datasets cover a wide range of depth values. Here we show the distribution of % depths in the DTU jensen2014large, ScanNet dai2017scannet, ETH3D schoeps2017cvpr, Tanks and Temples Knapitsch2017, and KITTI Geiger2012CVPR datasets, as a stacked bar chart. Note the log x-axis. This wide range of depth values can be challenging when it comes to constructing meaningful cost volumes and predicting the final depths.
  • Figure 3: Our general-purpose multi-view depth estimation model. We start with a cost-volume based architecture, which matches deep features between views at different hypothesized depths. Key for performance are our Cost Volume Patchifier and Mono/Multi Cue Combiner. These also fuse single-view information coming from the Reference Image Encoder and source views.
  • Figure 4: Our cost volume patchifier enables high-quality information to be extracted from a $|\mathcal{D}| \times \frac{H}{4} \times \frac{W}{4}$ cost volume, ready for input to the Mono/Multi Cue Combiner ViT. (a) Shows the naive approach to patchification. (b) Our approach makes better use of the reference image features.
  • Figure 5: Many MVS models fail in areas of poor frame overlap. Here we show how MVSFormer++ (right) fails to recover geometry in areas of the image where there are no matching pixels between source and target views (see the top left corner). Our model (middle) handles this situation gracefully.
  • ...and 12 more figures