Table of Contents
Fetching ...

Learning 3D Reconstruction with Priors in Test Time

Lei Zhou, Haoyu Wu, Akshat Dave, Dimitris Samaras

Abstract

We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.

Learning 3D Reconstruction with Priors in Test Time

Abstract

We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.

Paper Structure

This paper contains 35 sections, 13 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Method Overview.(left) Multi-view Transformers (MVTs) take a set of RGB images as input and output depth maps, camera poses, and intrinsics. (middle) Given camera priors, MapAnything keetha2025mapanything and Pow3R jang2025pow3r feed them into the network as additional input modalities, which requires retraining a modified MVT. (right) Our method, Test-time Constrained Optimization (TCO), treats the priors as constraints on the MVT's predictions and optimizes the network with LoRA using both prior penalty terms and a self-supervised objective, namely MVT prediction compatibility, at inference time.
  • Figure 2: Qualitative Comparison. We compare TCO-VGGT with the base image-only model VGGT and the prior-aware feed-forward methods Pow3R and MapAnything. Overall, TCO-VGGT effectively corrects structural errors in image-only reconstructions by incorporating camera priors. Red, orange, and green circles highlight regions that are wrongly reconstructed, partially corrected, and correctly reconstructed, respectively. In the first row, TCO-VGGT corrects the inaccurate relative positions of two walls in the VGGT reconstruction. The same phenomenon is observed in the second and third rows, where the inaccurate scene structures are corrected. In the last row, TCO-VGGT reconstructs the hand as a whole, whereas VGGT reconstructs it as separate parts. MapAnything reconstructs the hand with blurry boundaries. Additional fine-grained qualitative comparisons with MapAnything and Pow3R are provided in the Supplementary Material.
  • Figure 3: Test-time scaling curve of our method on ETH3D.
  • Figure 4: Fine-grained Qualitative Results. We compare TCO-VGGT with prior-aware feed-forward methods, including Pow3R and MapAnything. In each grid cell, the predicted geometry is overlaid with the ground truth geometry, whose points are shown in green. Discrepancies between the predicted and ground-truth geometries are highlighted by red double arrows, whose lengths indicate the magnitude of the errors. TCO-VGGT exhibits much smaller discrepancies than MapAnything and Pow3R in both scene structure and boundary regions.
  • Figure 5: 2DGS Rendering Visualization. We visualize the 2DGS rendering process for one scene from 7-Scenes. As shown in the Rendered Image row, our 2DGS heuristic parameterization produces rendered images that closely match the ground-truth images. We also compare the depth maps and normal maps rendered from 2DGS with the corresponding ground-truth depth and normal maps, i.e., those directly predicted from the MVT views.