Table of Contents
Fetching ...

Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

Weining Ren, Hongjun Wang, Xiao Tan, Kai Han

TL;DR

Fin3R tackles the gap in fine geometric fidelity and robustness of feed-forward 3D reconstruction by freezing the decoder and performing encoder-only fine-tuning via monocular knowledge distillation from a strong teacher on diverse unlabeled data. A re-normalization LoRA adapter mitigates feature-norm drift, allowing improved single-view depth and multi-view consistency across four baselines (DUSt3R, MASt3R, CUT3R, VGGT) with minimal overhead. The approach combines monocular pseudo-labels, data replay, and uncertainty weighting to yield crisper boundaries and richer geometry, while preserving or slightly improving multi-view performance. This lightweight, general fine-tuning strategy offers practical gains for real-world 3D reconstruction without resorting to per-scene optimization or heavy architectural changes, and demonstrates cross-head benefits through a robust encoder.

Abstract

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}

Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

TL;DR

Fin3R tackles the gap in fine geometric fidelity and robustness of feed-forward 3D reconstruction by freezing the decoder and performing encoder-only fine-tuning via monocular knowledge distillation from a strong teacher on diverse unlabeled data. A re-normalization LoRA adapter mitigates feature-norm drift, allowing improved single-view depth and multi-view consistency across four baselines (DUSt3R, MASt3R, CUT3R, VGGT) with minimal overhead. The approach combines monocular pseudo-labels, data replay, and uncertainty weighting to yield crisper boundaries and richer geometry, while preserving or slightly improving multi-view performance. This lightweight, general fine-tuning strategy offers practical gains for real-world 3D reconstruction without resorting to per-scene optimization or heavy architectural changes, and demonstrates cross-head benefits through a robust encoder.

Abstract

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}

Paper Structure

This paper contains 45 sections, 22 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Fin3R consistently improves the reconstructed geometry quality in DUSt3R, MASt3R, CUT3R, and VGGT, recovering finer details and producing sharper boundaries.
  • Figure 2: Analysis of scale uncertainty and error metrics. (a) Two views of a red cube are connected by a blue epipolar line. Gaussian distributions overlaid on a foreground point (green) and a background point (yellow) illustrate their respective scale uncertainties, with the foreground exhibiting notably larger epipolar dispersion after projection. (b) Reprojection error and Euclidean distance loss are computed for 10 inputs processed by VGGT wang2025vggt, with 1,000 samples drawn from Hypersim.
  • Figure 3: Heatmaps show spatial variations in $L_2$ norms of encoder patch tokens across configurations. "Avg" is the average norm of the feature map, and (e) Full indicates the full model with re-normalization LoRA and multi-view data replay.
  • Figure 4: Pipeline of our method. Green dashed lines denote pointmap supervision; purple dashed lines indicate distillation supervision.
  • Figure 5: Depth prediction across baseline models.$\bigstar$ indicates integration with our method.
  • ...and 6 more figures