Table of Contents
Fetching ...

ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo

Yuxi Hu, Jun Zhang, Zhe Zhang, Rafael Weilharter, Yuchen Rao, Kuangyi Chen, Runze Yuan, Friedrich Fraundorfer

TL;DR

ICG-MVSNet tackles depth estimation in multi-view stereo by explicitly leveraging geometric information within a single view and across views. It introduces Intra-View Fusion (IVF) to encode coordinate dependencies in a lightweight manner and Cross-View Aggregation (CVA) to propagate contextual priors across stages and depth hypotheses, within a coarse-to-fine 4-stage framework. A compact 3D-CNN regularizer yields a probability volume over depth hypotheses, optimized by a pixel-wise cross-entropy loss across stages. Across DTU and Tanks & Temples, the method achieves competitive or superior accuracy and completeness while using lower memory and faster inference than many peers, highlighting practical efficiency gains for 3D reconstruction tasks.

Abstract

Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.

ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo

TL;DR

ICG-MVSNet tackles depth estimation in multi-view stereo by explicitly leveraging geometric information within a single view and across views. It introduces Intra-View Fusion (IVF) to encode coordinate dependencies in a lightweight manner and Cross-View Aggregation (CVA) to propagate contextual priors across stages and depth hypotheses, within a coarse-to-fine 4-stage framework. A compact 3D-CNN regularizer yields a probability volume over depth hypotheses, optimized by a pixel-wise cross-entropy loss across stages. Across DTU and Tanks & Temples, the method achieves competitive or superior accuracy and completeness while using lower memory and faster inference than many peers, highlighting practical efficiency gains for 3D reconstruction tasks.

Abstract

Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.

Paper Structure

This paper contains 17 sections, 12 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison with state-of-the-art methods in runtime and GPU consumption on DTU dtu. Our method achieves state-of-the-art performance while maintaining efficient inference time and low memory usage.
  • Figure 2: The overall architecture. Our method is a coarse-to-fine framework that estimates depths from low resolution ($stage \ \ell$) to high resolution ($stage \ \ell+1$), where $\ell = 0, 1, 2$, resulting in a total of $4$ stages. Features of reference and source images $\{\bm{F}_{i}\}_{i=0}^{N}$ are extracted by a feature pyramid network with the help of Intra-View Fusion (IVF), whose details are illustrated in (a). The source image features are warped into the $D$ frustum planes of the reference camera and an element-wise multiplication is used to correlate each source image with the reference image. These correlations are aggregated into a single cost volume $\bm{C}$. In finer stages (stage $1$, $2$, and $3$), both current and previous stage correlations are used in Cross-View Aggregation (CVA), whereas in stage $0$, the cost volume is not updated due to the absence of contextual correlations from a previous stage. Details of this process are illustrated in (b) and (c). Regularization (3D CNN) yields the probability volume $\bm{P}$, from which the depth hypothesis with the highest probability is selected for the final depth map. Depth maps from multiple viewpoints are fused into a point cloud, in a non-learnable process.
  • Figure 3: Qualitative comparison with other methods on the DTU dtu dataset. The depth map estimated by our method has a more complete and continuous surface and also has clearer outlines at the edges.
  • Figure 4: Qualitative comparison with other methods on DTU dtu. The depth map estimated by our method has a more complete and continuous surface and also has clearer outlines at the edges.
  • Figure 5: Point clouds error comparison of state-of-the-art methods on the Tanks and Temples dataset tanksandtemples.$\tau$ is the scene-relevant distance threshold determined officially and darker means large error. The first row shows Precision and the second row shows Recall. Taking the Horse in the intermediate subset as an example, our method is able to reduce large amounts of outliers while ensuring completeness.
  • ...and 2 more figures