Table of Contents
Fetching ...

MVS-TTA: Test-Time Adaptation for Multi-View Stereo via Meta-Auxiliary Learning

Hannuo Zhang, Zhixiang Chi, Yang Wang, Xinxin Zuo

TL;DR

MVS-TTA addresses the limited generalization of learning-based multi-view stereo by introducing test-time adaptation guided by a self-supervised cross-view photometric consistency objective. A meta-auxiliary learning strategy trains models to benefit from lightweight adaptation at inference, enabling rapid scene-specific refinement without extra labels. The approach is model-agnostic and demonstrates consistent gains across DTU, BlendedMVS, and cross-dataset scenarios, with improved depth accuracy and robustness to domain shifts. This framework offers a practical pathway to bring optimization-style adaptability to data-driven MVS pipelines in real-world deployments, preserving efficiency while enhancing reconstruction fidelity.

Abstract

Recent learning-based multi-view stereo (MVS) methods are data-driven and have achieved remarkable progress due to large-scale training data and advanced architectures. However, their generalization remains sub-optimal due to fixed model parameters trained on limited training data distributions. In contrast, optimization-based methods enable scene-specific adaptation but lack scalability and require costly per-scene optimization. In this paper, we propose MVS-TTA, an efficient test-time adaptation (TTA) framework that enhances the adaptability of learning-based MVS methods by bridging these two paradigms. Specifically, MVS-TTA employs a self-supervised, cross-view consistency loss as an auxiliary task to guide inference-time adaptation. We introduce a meta-auxiliary learning strategy to train the model to benefit from auxiliary-task-based updates explicitly. Our framework is model-agnostic and can be applied to a wide range of MVS methods with minimal architectural changes. Extensive experiments on standard datasets (DTU, BlendedMVS) and a challenging cross-dataset generalization setting demonstrate that MVS-TTA consistently improves performance, even when applied to state-of-the-art MVS models. To our knowledge, this is the first attempt to integrate optimization-based test-time adaptation into learning-based MVS using meta-learning. The code will be available at https://github.com/mart87987-svg/MVS-TTA.

MVS-TTA: Test-Time Adaptation for Multi-View Stereo via Meta-Auxiliary Learning

TL;DR

MVS-TTA addresses the limited generalization of learning-based multi-view stereo by introducing test-time adaptation guided by a self-supervised cross-view photometric consistency objective. A meta-auxiliary learning strategy trains models to benefit from lightweight adaptation at inference, enabling rapid scene-specific refinement without extra labels. The approach is model-agnostic and demonstrates consistent gains across DTU, BlendedMVS, and cross-dataset scenarios, with improved depth accuracy and robustness to domain shifts. This framework offers a practical pathway to bring optimization-style adaptability to data-driven MVS pipelines in real-world deployments, preserving efficiency while enhancing reconstruction fidelity.

Abstract

Recent learning-based multi-view stereo (MVS) methods are data-driven and have achieved remarkable progress due to large-scale training data and advanced architectures. However, their generalization remains sub-optimal due to fixed model parameters trained on limited training data distributions. In contrast, optimization-based methods enable scene-specific adaptation but lack scalability and require costly per-scene optimization. In this paper, we propose MVS-TTA, an efficient test-time adaptation (TTA) framework that enhances the adaptability of learning-based MVS methods by bridging these two paradigms. Specifically, MVS-TTA employs a self-supervised, cross-view consistency loss as an auxiliary task to guide inference-time adaptation. We introduce a meta-auxiliary learning strategy to train the model to benefit from auxiliary-task-based updates explicitly. Our framework is model-agnostic and can be applied to a wide range of MVS methods with minimal architectural changes. Extensive experiments on standard datasets (DTU, BlendedMVS) and a challenging cross-dataset generalization setting demonstrate that MVS-TTA consistently improves performance, even when applied to state-of-the-art MVS models. To our knowledge, this is the first attempt to integrate optimization-based test-time adaptation into learning-based MVS using meta-learning. The code will be available at https://github.com/mart87987-svg/MVS-TTA.

Paper Structure

This paper contains 18 sections, 13 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Qualitative comparison on 3 textureless samples from the DTU dataset. From top to bottom: reference image, ground-truth depth, prediction by MVSFormer++ (baseline), and prediction by MVSFormer++ + MVS-TTA.
  • Figure 2: Overview of meta-auxiliary training in the proposed MVS-TTA framework. Given a batch of samples $\{ \{ {P_{i,b}}\} _{i = 1}^{M + 1},{D_{{\rm{gt}},b}}\} _{b = 1}^B$, the meta training process follows a nested loop structure. For each sample $(\{ {P_{i,b}}\} _{i = 1}^{M + 1},{D_{{\rm{gt}},b}})$, we first adapt the model parameters $\theta$ for a few steps using the photometric consistency loss ${L_{photo}}$ as an auxiliary task. Then, in the outer loop, the adapted model ${\phi _b}$ performs the primary task of depth inference, where the primary loss ${L_{pri}}$ measuring the discrepancy between the predicted depth map and the ground-truth annotation is computed and used to update the original model parameters $\theta$.
  • Figure 3: Overview of the test-time adaptation procedure. Given a test sample, we adapt the meta-trained model for a few steps using the auxiliary task of photometric consistency loss. The adapted model is then used to infer the depth map of that image.
  • Figure 4: Qualitative comparison between different baselines and our MVS-TTA framework. Each column corresponds to a different MVS model and dataset combination: the first two columns show results on the DTU dataset, and the last column is from BlendedMVS. From top to bottom: reference image, ground-truth depth map, depth prediction from the baseline model, and depth prediction after applying MVS-TTA. Our method improves the accuracy of depth prediction across different models and datasets.
  • Figure 5: Qualitative comparison of cross-dataset generalization. The first row shows results using TransMVSNet as backbone, and the second row presents results using CasMVSNet as backbone. From left to right: reference image, depth prediction using baseline model without any adaptation, and depth prediction with MVS-TTA applied. In the first row, the scattered erroneous predictions in front of the building façade are reduced, and in the second row, the incorrect depth around the upper-left region of the instrument is corrected.
  • ...and 2 more figures