Table of Contents
Fetching ...

CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning

Kaiqiang Xiong, Rui Peng, Zhe Zhang, Tianxing Feng, Jianbo Jiao, Feng Gao, Ronggang Wang

TL;DR

This paper tackles unsupervised multi-view stereo (MVS), where relying solely on photometric consistency often fails in indistinguishable regions and under view-dependent effects. It introduces CL-MVSNet, which integrates dual-level contrastive learning—image-level and scene-level—alongside an $L_{0.5}$ photometric consistency loss to improve context usage and robustness. The approach achieves state-of-the-art results among end-to-end unsupervised methods on DTU and Tanks&Temples, and even surpasses certain supervised baselines without fine-tuning, demonstrating strong generalization and reconstruction quality. These advances reduce dependence on ground-truth 3D data and enhance 3D reconstruction in challenging scenes, with some remaining limitations at object edges.

Abstract

Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning approach, named CL-MVSNet. Specifically, our model integrates two contrastive branches into an unsupervised MVS framework to construct additional supervisory signals. On the one hand, we present an image-level contrastive branch to guide the model to acquire more context awareness, thus leading to more complete depth estimation in indistinguishable regions. On the other hand, we exploit a scene-level contrastive branch to boost the representation ability, improving robustness to view-dependent effects. Moreover, to recover more accurate 3D geometry, we introduce an L0.5 photometric consistency loss, which encourages the model to focus more on accurate points while mitigating the gradient penalty of undesirable ones. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that our approach achieves state-of-the-art performance among all end-to-end unsupervised MVS frameworks and outperforms its supervised counterpart by a considerable margin without fine-tuning.

CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning

TL;DR

This paper tackles unsupervised multi-view stereo (MVS), where relying solely on photometric consistency often fails in indistinguishable regions and under view-dependent effects. It introduces CL-MVSNet, which integrates dual-level contrastive learning—image-level and scene-level—alongside an photometric consistency loss to improve context usage and robustness. The approach achieves state-of-the-art results among end-to-end unsupervised methods on DTU and Tanks&Temples, and even surpasses certain supervised baselines without fine-tuning, demonstrating strong generalization and reconstruction quality. These advances reduce dependence on ground-truth 3D data and enhance 3D reconstruction in challenging scenes, with some remaining limitations at object edges.

Abstract

Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning approach, named CL-MVSNet. Specifically, our model integrates two contrastive branches into an unsupervised MVS framework to construct additional supervisory signals. On the one hand, we present an image-level contrastive branch to guide the model to acquire more context awareness, thus leading to more complete depth estimation in indistinguishable regions. On the other hand, we exploit a scene-level contrastive branch to boost the representation ability, improving robustness to view-dependent effects. Moreover, to recover more accurate 3D geometry, we introduce an L0.5 photometric consistency loss, which encourages the model to focus more on accurate points while mitigating the gradient penalty of undesirable ones. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that our approach achieves state-of-the-art performance among all end-to-end unsupervised MVS frameworks and outperforms its supervised counterpart by a considerable margin without fine-tuning.

Paper Structure

This paper contains 19 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Qualitative comparison of reconstruction quality with the SOTA method chang2022rc on scan29 of DTU aanaes2016large. Our method performs better on repetitive patterns.
  • Figure 2: The framework of CL-MVSNet. The framework consists of: (a) a Regular Branch with a regular sample similar to CasMVSNet gu2020cascade, (b) an Image-level Contrastive Branch with the image-level contrastive sample , (c) a Scene-level Contrastive Branch with the scene-level contrastive sample. To pull positive pairs close, we enforce contrastive consistency between the regular branch and two contrastive branches, with the confidence mask estimated from the regular branch. Moreover, our proposed $\mathcal{L}0.5$ photometric consistency is enforced between the reconstructed images and the input reference image on the regular branch for more accurate reconstruction.
  • Figure 3: Photometric consistency. The source images are warped to reconstruct the reference image with the inferred depth map on the reference view. Then consistency is enforced between the reconstructed images and the reference image.
  • Figure 4: Image-level contrastive sample. (a) a source image of the regular sample. (b) a source image of the image-level contrastive sample. A Bernoulli-distributed binary mask is used to simulate the failure case of local photometric consistency in (b).
  • Figure 5: View-dependent effects and occlusions. From source view4, point $A$ is occluded and only point $A'$ is visible along the line of sight. Besides, The appearance of the identical region $B$ differs in different views due to variations in illumination, camera exposure, and reflections.
  • ...and 6 more figures