Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation
Weining Ren, Hongjun Wang, Xiao Tan, Kai Han
TL;DR
Fin3R tackles the gap in fine geometric fidelity and robustness of feed-forward 3D reconstruction by freezing the decoder and performing encoder-only fine-tuning via monocular knowledge distillation from a strong teacher on diverse unlabeled data. A re-normalization LoRA adapter mitigates feature-norm drift, allowing improved single-view depth and multi-view consistency across four baselines (DUSt3R, MASt3R, CUT3R, VGGT) with minimal overhead. The approach combines monocular pseudo-labels, data replay, and uncertainty weighting to yield crisper boundaries and richer geometry, while preserving or slightly improving multi-view performance. This lightweight, general fine-tuning strategy offers practical gains for real-world 3D reconstruction without resorting to per-scene optimization or heavy architectural changes, and demonstrates cross-head benefits through a robust encoder.
Abstract
We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}
