Table of Contents
Fetching ...

VHS: High-Resolution Iterative Stereo Matching with Visual Hull Priors

Markus Plack, Hannah Dröge, Leif Van Holland, Matthias B. Hullin

TL;DR

VHS tackles high-resolution stereo depth estimation by integrating visual hull priors derived from auxiliary views into a sparse-dense, memory-efficient RAFT-like pipeline. It constrains the disparity search with hull-derived bounds and uses a ConvGRU-based iterative refinement with locally computed correlations, avoiding full 3D cost volumes. Key contributions include a memory-efficient sparse-to-dense correlation scheme, hull-guided initial disparity and weak priors during refinement, and a memory-friendly training approach enabling high-resolution learning on syntheticObjaverse-XL data. The approach yields strong accuracy and robustness on high-resolution datasets while reducing memory usage, with practical impact for volumetric capture and real-time or near-real-time depth estimation in complex scenes.

Abstract

We present a stereo-matching method for depth estimation from high-resolution images using visual hulls as priors, and a memory-efficient technique for the correlation computation. Our method uses object masks extracted from supplementary views of the scene to guide the disparity estimation, effectively reducing the search space for matches. This approach is specifically tailored to stereo rigs in volumetric capture systems, where an accurate depth plays a key role in the downstream reconstruction task. To enable training and regression at high resolutions targeted by recent systems, our approach extends a sparse correlation computation into a hybrid sparse-dense scheme suitable for application in leading recurrent network architectures. We evaluate the performance-efficiency trade-off of our method compared to state-of-the-art methods, and demonstrate the efficacy of the visual hull guidance. In addition, we propose a training scheme for a further reduction of memory requirements during optimization, facilitating training on high-resolution data.

VHS: High-Resolution Iterative Stereo Matching with Visual Hull Priors

TL;DR

VHS tackles high-resolution stereo depth estimation by integrating visual hull priors derived from auxiliary views into a sparse-dense, memory-efficient RAFT-like pipeline. It constrains the disparity search with hull-derived bounds and uses a ConvGRU-based iterative refinement with locally computed correlations, avoiding full 3D cost volumes. Key contributions include a memory-efficient sparse-to-dense correlation scheme, hull-guided initial disparity and weak priors during refinement, and a memory-friendly training approach enabling high-resolution learning on syntheticObjaverse-XL data. The approach yields strong accuracy and robustness on high-resolution datasets while reducing memory usage, with practical impact for volumetric capture and real-time or near-real-time depth estimation in complex scenes.

Abstract

We present a stereo-matching method for depth estimation from high-resolution images using visual hulls as priors, and a memory-efficient technique for the correlation computation. Our method uses object masks extracted from supplementary views of the scene to guide the disparity estimation, effectively reducing the search space for matches. This approach is specifically tailored to stereo rigs in volumetric capture systems, where an accurate depth plays a key role in the downstream reconstruction task. To enable training and regression at high resolutions targeted by recent systems, our approach extends a sparse correlation computation into a hybrid sparse-dense scheme suitable for application in leading recurrent network architectures. We evaluate the performance-efficiency trade-off of our method compared to state-of-the-art methods, and demonstrate the efficacy of the visual hull guidance. In addition, we propose a training scheme for a further reduction of memory requirements during optimization, facilitating training on high-resolution data.
Paper Structure (21 sections, 8 equations, 8 figures, 4 tables)

This paper contains 21 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We propose a technique to induce a rough shape estimate from object masks (top) as prior information to a novel, sparse-dense stereo-matching network (bottom) for the application in capture stages (left) for accurate and memory-efficient disparity estimation (right).
  • Figure 2: Overview of the three stages of our disparity estimation network VHS. Following the Feature Extraction we compute an Initial Disparity estimate $D_0$ from a sparse $k$NN cost volume restricted by the visual hull. Next, we perform an Iterative Refinement of the disparity guided by the visual hull prior using ConvGRU modules and dense local correlations with window size $k'$.
  • Figure 3: Estimation of the disparity boundaries $(b_p^{min}, b_p^{max})$, from two rectified views of an object's visual hull. The visual hull encloses the objects' surface, so the surface is guaranteed to lie within the disparity boundaries.
  • Figure 4: Sample from the FlyingObjaverse training dataset. Notice how the true disparity is close to the upper disparity limit except for the basin in the bottom right, which cannot be recovered from the visual hull.
  • Figure 5: Memory efficient training scheme for $n=2$ consecutive update steps. After the computation of the losses $\mathcal{L}_i$ and $\mathcal{L}_{i+1}$, we perform backpropagation to accumulate gradients of the update network parameters and detach the hidden state effectively freeing the computational graph. $\sum \Delta$ indicates an optional accumulation of gradients to avoid multiple backward passes through the feature extraction network.
  • ...and 3 more figures