CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer
Sicheng Wang, Hao Jiang, Lei Xiang
TL;DR
CT-MVSNet addresses the scalability issue of transformer-based MVS by introducing a cross-scale transformer, AMT, which interleaves intra- and inter-attention blocks across feature pyramid scales to capture both intra-image context and inter-view relationships. It further strengthens depth reconstruction with DFGA, which injects coarse global semantics into finer cost volumes, and FM Loss, which stabilizes cross-view feature matching. The method achieves state-of-the-art results on DTU and Tanks & Temples while preserving efficiency through linear-attention mechanisms and scale-aware attention allocation. Overall, CT-MVSNet advances dense 3D reconstruction by unifying cross-scale context, global-semantic guidance, and feature-consistency penalties, with code available for reproduction.
Abstract
Recent deep multi-view stereo (MVS) methods have widely incorporated transformers into cascade network for high-resolution depth estimation, achieving impressive results. However, existing transformer-based methods are constrained by their computational costs, preventing their extension to finer stages. In this paper, we propose a novel cross-scale transformer (CT) that processes feature representations at different stages without additional computation. Specifically, we introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. This combined strategy enables our network to capture intra-image context information and enhance inter-image feature relationships. Besides, we present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction to further strengthen global and local feature awareness. Meanwhile, we design a feature metric loss (FM Loss) that evaluates the feature bias before and after transformation to reduce the impact of feature mismatch on depth estimation. Extensive experiments on DTU dataset and Tanks and Temples (T\&T) benchmark demonstrate that our method achieves state-of-the-art results. Code is available at https://github.com/wscstrive/CT-MVSNet.
