Table of Contents
Fetching ...

R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors

Haoyang Wang, Liming Liu, Peiheng Wang, Junlin Hao, Jiangkai Wu, Xinggong Zhang

TL;DR

R-Meshfusion addresses sparse-view mesh reconstruction by leveraging diffusion priors through a Consensus Diffusion Module and an online UCB-based viewpoint selector, integrated into a two-stage NeRF-based optimization. It introduces diffusion-guided knowledge distillation, a robust consensus mechanism with IQR filtering and variance-aware fusion, and an adaptive view selection policy to maximize informative supervision. The method jointly optimizes geometry and appearance via a NeRF-SDF pipeline with diffusion losses, achieving notable improvements in geometric accuracy and rendering quality on real sparse-view data. This framework narrows the gap between diffusion priors and reliable 3D reconstruction, enabling more robust multi-view mesh generation in constrained capture scenarios.

Abstract

Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.

R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors

TL;DR

R-Meshfusion addresses sparse-view mesh reconstruction by leveraging diffusion priors through a Consensus Diffusion Module and an online UCB-based viewpoint selector, integrated into a two-stage NeRF-based optimization. It introduces diffusion-guided knowledge distillation, a robust consensus mechanism with IQR filtering and variance-aware fusion, and an adaptive view selection policy to maximize informative supervision. The method jointly optimizes geometry and appearance via a NeRF-SDF pipeline with diffusion losses, achieving notable improvements in geometric accuracy and rendering quality on real sparse-view data. This framework narrows the gap between diffusion priors and reliable 3D reconstruction, enabling more robust multi-view mesh generation in constrained capture scenarios.

Abstract

Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.

Paper Structure

This paper contains 29 sections, 26 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our framework, R-Meshfusion. We adopt a two-stage, coarse-to-fine training process for mesh reconstruction. In stage 1, we initialize the geometry and view-dependent appearance representation based on NeRF. This initial phase results in a coarse SDF grid and an initial view-dependent appearance model. Then in stage 2, we leverage a pretrained conditioned diffusion model and apply knowledge distillation for further mesh refinement. For each training iteration, we take three steps. We first select the optimal viewpoint based on UCB strategy to render the corresponding image from mesh. We then apply our proposed Consensus Diffusion Module to obtain a higher-quality diffusion-guided (DG) image. Finally, we simultaneously refine both the geometry and the appearance representation. After training is complete, we obtain the final mesh.
  • Figure 2: Illustration of our Consensus Diffusion Module.
  • Figure 3: Illustration of a mesh-rendered image on the left, followed by diffusion-generated images on the right.
  • Figure 4: Qualitative comparison of mesh rendering quality and geometry among our method and baselines. Under sparse-view inputs, our method achieves superior visual fidelity and more accurate geometry.
  • Figure 5: Rendering quality with (right) and without (middle) image fusion (IF). IF critically preserves fine details and reduces blurred artifacts.