Table of Contents
Fetching ...

CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer

Sicheng Wang, Hao Jiang, Lei Xiang

TL;DR

CT-MVSNet addresses the scalability issue of transformer-based MVS by introducing a cross-scale transformer, AMT, which interleaves intra- and inter-attention blocks across feature pyramid scales to capture both intra-image context and inter-view relationships. It further strengthens depth reconstruction with DFGA, which injects coarse global semantics into finer cost volumes, and FM Loss, which stabilizes cross-view feature matching. The method achieves state-of-the-art results on DTU and Tanks & Temples while preserving efficiency through linear-attention mechanisms and scale-aware attention allocation. Overall, CT-MVSNet advances dense 3D reconstruction by unifying cross-scale context, global-semantic guidance, and feature-consistency penalties, with code available for reproduction.

Abstract

Recent deep multi-view stereo (MVS) methods have widely incorporated transformers into cascade network for high-resolution depth estimation, achieving impressive results. However, existing transformer-based methods are constrained by their computational costs, preventing their extension to finer stages. In this paper, we propose a novel cross-scale transformer (CT) that processes feature representations at different stages without additional computation. Specifically, we introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. This combined strategy enables our network to capture intra-image context information and enhance inter-image feature relationships. Besides, we present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction to further strengthen global and local feature awareness. Meanwhile, we design a feature metric loss (FM Loss) that evaluates the feature bias before and after transformation to reduce the impact of feature mismatch on depth estimation. Extensive experiments on DTU dataset and Tanks and Temples (T\&T) benchmark demonstrate that our method achieves state-of-the-art results. Code is available at https://github.com/wscstrive/CT-MVSNet.

CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer

TL;DR

CT-MVSNet addresses the scalability issue of transformer-based MVS by introducing a cross-scale transformer, AMT, which interleaves intra- and inter-attention blocks across feature pyramid scales to capture both intra-image context and inter-view relationships. It further strengthens depth reconstruction with DFGA, which injects coarse global semantics into finer cost volumes, and FM Loss, which stabilizes cross-view feature matching. The method achieves state-of-the-art results on DTU and Tanks & Temples while preserving efficiency through linear-attention mechanisms and scale-aware attention allocation. Overall, CT-MVSNet advances dense 3D reconstruction by unifying cross-scale context, global-semantic guidance, and feature-consistency penalties, with code available for reproduction.

Abstract

Recent deep multi-view stereo (MVS) methods have widely incorporated transformers into cascade network for high-resolution depth estimation, achieving impressive results. However, existing transformer-based methods are constrained by their computational costs, preventing their extension to finer stages. In this paper, we propose a novel cross-scale transformer (CT) that processes feature representations at different stages without additional computation. Specifically, we introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. This combined strategy enables our network to capture intra-image context information and enhance inter-image feature relationships. Besides, we present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction to further strengthen global and local feature awareness. Meanwhile, we design a feature metric loss (FM Loss) that evaluates the feature bias before and after transformation to reduce the impact of feature mismatch on depth estimation. Extensive experiments on DTU dataset and Tanks and Temples (T\&T) benchmark demonstrate that our method achieves state-of-the-art results. Code is available at https://github.com/wscstrive/CT-MVSNet.
Paper Structure (25 sections, 11 equations, 5 figures, 6 tables)

This paper contains 25 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of our CT-MVSNet. FPN first extracts multi-scale features and then introduces AMT, which employs different attention combinations to enhance features at each stage (Sect. \ref{['sec3.2']}). W&F is used to construct cost volume under depth hypotheses and the DFGA is added in finer stages to further refine cost volume construction (Sect. \ref{['sec3.3']}). The 3D U-Net is used to obtain a probability volume, and then $winner-take-all$ (WTA) is used to get a depth map (Sect. \ref{['sec3.3']}). The CE Loss with one-hot labels is applied to probability volume and the FM Loss with feature metric is applied to the depth map (Sect. \ref{['sec3.4']}). Then we update the depth hypotheses by up-sampling to finer stages and finally estimate coarse-to-fine depth maps.
  • Figure 2: Illustration of AMT. (a)Interleaved conbination of intra- and inter-attention at each stage. (b)Internal architecture of two attentions, each stage consist of interleaving-arranged intra-attention (w.r.t. $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$) within images, and inter-attention (w.r.t. $\mathbf{Q}$, $\mathbf{K}'$, $\mathbf{V}'$) across images.
  • Figure 3: Illustration of DFGA. Our DFGA is only applied in the second and third stage.
  • Figure 4: Point clouds error comparison of state-of-the-art methods on DTU dateset. The first and third rows are estimated depth maps while others are point cloud reconstruction results.
  • Figure 5: Point clouds error comparison of state-of-the-art methods on the Tanks and Temples dataset.$\tau$ is the scene-relevant distance threshold determined officially and darker means larger errors.