R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors
Haoyang Wang, Liming Liu, Peiheng Wang, Junlin Hao, Jiangkai Wu, Xinggong Zhang
TL;DR
R-Meshfusion addresses sparse-view mesh reconstruction by leveraging diffusion priors through a Consensus Diffusion Module and an online UCB-based viewpoint selector, integrated into a two-stage NeRF-based optimization. It introduces diffusion-guided knowledge distillation, a robust consensus mechanism with IQR filtering and variance-aware fusion, and an adaptive view selection policy to maximize informative supervision. The method jointly optimizes geometry and appearance via a NeRF-SDF pipeline with diffusion losses, achieving notable improvements in geometric accuracy and rendering quality on real sparse-view data. This framework narrows the gap between diffusion priors and reliable 3D reconstruction, enabling more robust multi-view mesh generation in constrained capture scenarios.
Abstract
Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.
