Table of Contents
Fetching ...

Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations

Youyu Chen, Junjun Jiang, Yueru Luo, Kui Jiang, Xianming Liu, Xu Yan, Dave Zhenyu Chen

Abstract

With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.

Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations

Abstract

With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.

Paper Structure

This paper contains 29 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: In this paper, we propose Reliev3R, the first learning paradigm to train a Feed-forward Reconstruction Model (FFRM) from scratch without reliance on multi-view geometric annotations. To learn multi-view geometric knowledge, Reliev3R engages pseudo monocular relative depths and pseudo image sparse correspondences with multi-view geometry constraints as a weakly-supervised learning objective. As shown above in the figure, Reliev3R surpasses early FFRMs (e.g., FLARE zhang2025flare) and weakly-supervised camera pose estimation models (e.g., AnyCam wimbauer2025anycam) in the overall performance.
  • Figure 2: Similar to prior FFRMs, Reliev3R performs geometry reconstruction in a single forward pass given a group of images. However, the supervision of Reliev3R doesn't depend on any multl-view geometric annotations (e.g., ground truth of point maps and camera poses which are registered in 3D world coordinate). Specifically, instead of directly predicting the point maps registered in world coordinate, Reliev3R predicts view-wise depth maps, which are regularized with pseudo relative depth. To learn the registration of depth maps and camera poses in world coordinate, Reliev3R draws supervision signal from pseudo image correspondences. Pseudo annotations used by Reliev3R are produced with pretrained expert models (see \ref{['sec:experiments:details']}). Experiments in \ref{['sec:experiments']} demonstrate that Reliev3R catches up with and even surpasses some of FFRMs supervised with multi-view geometric annotations.
  • Figure 3: Point map and camera pose visualization on DL3DV-benchmark dataset ling2024dl3dv. As shown above in the figure, Reliev3R delivers a visually comparable construction accuracy with the $\pi^3$wang2025pi while surpassing FLARE zhang2025flare in the overall performance. The camera pose of FLARE above is adopted directly from the model prediction instead of solved with PnP fischler1981random based on point maps.
  • Figure 4: Visualization of point maps for zero-shot evaluation on ScanNet++ dataset dai2017scannet, which has a different focal length from DL3DV-10K dataest ling2024dl3dv. Reliev3R holds its advantage over AnyCam wimbauer2025anycam even in zero-shot circumstance. Furthermore, Reliev3R presents a notable zero-shot performance on par with fully-supervised $\pi^{3\dag}$ training on the same datasets.
  • Figure 5: Visualization of input (multi-view images), output (multi-view depth maps and confidence maps) and pseudo labels (multi-view correspondences, monocular depth maps) of Reliev3R.
  • ...and 3 more figures