Table of Contents
Fetching ...

Boosting Multi-view Stereo with Late Cost Aggregation

Jiang Wu, Rui Li, Yu Zhu, Wenxun Zhao, Jinqiu Sun, Yanning Zhang

TL;DR

The paper tackles the problem that early aggregation of pairwise MVS costs can erode informative depth cues. It introduces a late cost aggregation framework that preserves per-view costs along a view channel, enabling the depth network to exploit geometric cues throughout the forward pass, with minor changes to the CasMVSNet pipeline. Supporting techniques include a view-shuffle strategy to mitigate view-order effects, mechanisms for handling variable numbers of testing views, and an improved multi-view consistency filtering. Empirical results on DTU, Tanks & Temples, and ETH3D show competitive performance with state-of-the-art methods while using far fewer parameters and offering favorable computation overhead.

Abstract

Pairwise matching cost aggregation is a crucial step for modern learning-based Multi-view Stereo (MVS). Prior works adopt an early aggregation scheme, which adds up pairwise costs into an intermediate cost. However, we analyze that this process can degrade informative pairwise matchings, thereby blocking the depth network from fully utilizing the original geometric matching cues. To address this challenge, we present a late aggregation approach that allows for aggregating pairwise costs throughout the network feed-forward process, achieving accurate estimations with only minor changes of the plain CasMVSNet. Instead of building an intermediate cost by weighted sum, late aggregation preserves all pairwise costs along a distinct view channel. This enables the succeeding depth network to fully utilize the crucial geometric cues without loss of cost fidelity. Grounded in the new aggregation scheme, we propose further techniques addressing view order dependence inside the preserved cost, handling flexible testing views, and improving the depth filtering process. Despite its technical simplicity, our method improves significantly upon the baseline cascade-based approach, achieving comparable results with state-of-the-art methods with favorable computation overhead.

Boosting Multi-view Stereo with Late Cost Aggregation

TL;DR

The paper tackles the problem that early aggregation of pairwise MVS costs can erode informative depth cues. It introduces a late cost aggregation framework that preserves per-view costs along a view channel, enabling the depth network to exploit geometric cues throughout the forward pass, with minor changes to the CasMVSNet pipeline. Supporting techniques include a view-shuffle strategy to mitigate view-order effects, mechanisms for handling variable numbers of testing views, and an improved multi-view consistency filtering. Empirical results on DTU, Tanks & Temples, and ETH3D show competitive performance with state-of-the-art methods while using far fewer parameters and offering favorable computation overhead.

Abstract

Pairwise matching cost aggregation is a crucial step for modern learning-based Multi-view Stereo (MVS). Prior works adopt an early aggregation scheme, which adds up pairwise costs into an intermediate cost. However, we analyze that this process can degrade informative pairwise matchings, thereby blocking the depth network from fully utilizing the original geometric matching cues. To address this challenge, we present a late aggregation approach that allows for aggregating pairwise costs throughout the network feed-forward process, achieving accurate estimations with only minor changes of the plain CasMVSNet. Instead of building an intermediate cost by weighted sum, late aggregation preserves all pairwise costs along a distinct view channel. This enables the succeeding depth network to fully utilize the crucial geometric cues without loss of cost fidelity. Grounded in the new aggregation scheme, we propose further techniques addressing view order dependence inside the preserved cost, handling flexible testing views, and improving the depth filtering process. Despite its technical simplicity, our method improves significantly upon the baseline cascade-based approach, achieving comparable results with state-of-the-art methods with favorable computation overhead.
Paper Structure (18 sections, 3 equations, 6 figures, 7 tables)

This paper contains 18 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison between early and late aggregation. (a) We compute the reservation ratio, ie, the proportion of pixels whose latent faithful depths are preserved by the final depth predictions. For the early aggregation method xu2022learning (Early aggre.), its preservation ratio declines after the aggregation (blue bar), resulting in inferior network predictions (purple bar). While our method (late aggre.) effectively attains more informative matching costs via late aggregation. (b) As a result, our method achieves higher depth accuracy (acc.$<$2mm, the percentage of pixels within $2mm$ depth error) compared to the early aggregation method.
  • Figure 2: Limitations of early aggregation. We showcase the early aggregation process of a pixel in PVSNet xu2022learning. There are three pairwise matching costs with one informative cost having the most faithful depth cues (pink background). Though the weight module managed to assign the informative cost with the highest weight, it still leads to a suboptimal intermediate cost (gray background), which compromises the informative cost with non-informative ones, due to indistinguishable weights. As a result, the depth network can not utilize the initial informative cost, leading to inferior depth predictions (blue background).
  • Figure 3: Late cost aggregation with view-preserving. For multiple pairwise costs, we apply a single convolutional layer separately to each cost for pre-regularization. Then, we preserve the pairwise costs along a view-based channel to build view-preserved costs for the depth network. To disentangle the view order dependence with late aggregation, we introduce the view shuffle scheme aimed at disrupting the view order.
  • Figure 4: Cost reconstruction for handling flexible numbers of testing views. (a) For more testing views, we select the most useful pairwise costs and iterate the remaining to reconstruct multiple cost that fits the initial cost shape. The final depth is yielded with the winner-takes-all strategy. (b) For fewer testing views, we keep all pairwise costs and duplicate the most useful one repetitively until the cost fits the desired shape.
  • Figure 5: Reconstruction of scan 13 on DTU. Our method achieves faithful reconstruction results, especially in textureless areas with high reflection. Note that (d) MVSTER uses an original resolution of 1600×1200 for network training, while other methods are trained on low-resolution images.
  • ...and 1 more figures