Table of Contents
Fetching ...

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy

TL;DR

3DEnhancer tackles the challenge of low-resolution, view-inconsistent 3D content by introducing a multi-view latent diffusion framework augmented with a pose-aware encoder, view-consistent DiT blocks, and epipolar-guided cross-view mechanisms. The approach leverages a 2D diffusion prior to refine coarse multi-view renders while enforcing cross-view coherence through multi-view row attention and near-view epipolar aggregation, aided by an extensive MV data augmentation pipeline. It supports enhancing outputs from existing MV diffusion models and directly refining coarse 3D representations via 3DGaussians or other reconstructions, yielding superior texture detail and consistency across views, as demonstrated on synthetic Objaverse data and in-the-wild objects with substantial qualitative and quantitative gains. Ablation and user studies confirm the effectiveness of the cross-view modules and augmentations, underscoring the method's potential for robust 3D texture refinement, editing, and reconstruction in practical pipelines.

Abstract

Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and a multi-view attention module with epipolar aggregation to maintain consistent, high-quality 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence across diverse viewing angles. Extensive evaluations show that 3DEnhancer significantly outperforms existing methods, boosting both multi-view enhancement and per-instance 3D optimization tasks.

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

TL;DR

3DEnhancer tackles the challenge of low-resolution, view-inconsistent 3D content by introducing a multi-view latent diffusion framework augmented with a pose-aware encoder, view-consistent DiT blocks, and epipolar-guided cross-view mechanisms. The approach leverages a 2D diffusion prior to refine coarse multi-view renders while enforcing cross-view coherence through multi-view row attention and near-view epipolar aggregation, aided by an extensive MV data augmentation pipeline. It supports enhancing outputs from existing MV diffusion models and directly refining coarse 3D representations via 3DGaussians or other reconstructions, yielding superior texture detail and consistency across views, as demonstrated on synthetic Objaverse data and in-the-wild objects with substantial qualitative and quantitative gains. Ablation and user studies confirm the effectiveness of the cross-view modules and augmentations, underscoring the method's potential for robust 3D texture refinement, editing, and reconstruction in practical pipelines.

Abstract

Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and a multi-view attention module with epipolar aggregation to maintain consistent, high-quality 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence across diverse viewing angles. Extensive evaluations show that 3DEnhancer significantly outperforms existing methods, boosting both multi-view enhancement and per-instance 3D optimization tasks.

Paper Structure

This paper contains 29 sections, 5 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Our proposed 3DEnhancer showcases excellent capabilities in enhancing multi-view images generated by various models. As shown in (a), it can significantly improve texture details, correct texture errors, and enhance consistency across views. Beyond enhancement, as illustrated in (b), 3DEnhancer also supports texture-level editing, including regional inpainting, and adjusting texture enhancement strength via noise level control. (Zoom-in for best view)
  • Figure 2: An overview of 3DEnhancer. By harnessing generative priors, 3DEnhancer adapts a text-to-image diffusion model to a multi-view framework aimed at 3D enhancement. It is compatible with multi-view images generated by models like MVDream shi2023MVDream or those rendered from coarse 3D representations like NeRFs mildenhall2020nerf and 3DGS kerbl3Dgaussians. Given LQ multi-view images along with their corresponding camera poses, 3DEnhancer aggregates multi-view information within a DiT Peebles2022DiT framework using row attention and epipolar aggregation modules, improving visual quality while preserving consistency across views. Furthermore, the model supports texture-level editing via text prompts and adjustable noise levels, allowing users to correct texture errors and control the enhancement strength.
  • Figure 3: Qualitative comparisons of enhancing multi-view synthesis on the Objaverse synthetic dataset. As can be seen, only 3DEnhancer can correct flowed and missing textures with view consistency.
  • Figure 4: Qualitative comparisons of enhancing multi-view synthesis with RealBasicVSRchan2022realbasicvsr and Upscale-A-Videozhou2024upscale on the in-the-wild dataset. Visually inspecting, 3DEnhancer yields sharp and consistent textures with intact semantics, such as the eyes of the girl.
  • Figure 5: Qualitative comparisons of enhancing 3D reconstruction given generated multi-view images on the in-the-wild dataset. Multi-view models produce low-quality, view-inconsistent outputs, leading to flawed 3D reconstructions. Existing methods fail to correct texture artifacts, while our method produces both geometrically accurate and visually appealing results.
  • ...and 13 more figures