Table of Contents
Fetching ...

Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion

Fan Yang, Jianfeng Zhang, Yichun Shi, Bowen Chen, Chenxu Zhang, Huichao Zhang, Xiaofeng Yang, Xiu Li, Jiashi Feng, Guosheng Lin

TL;DR

Magic-Boost tackles instability and low detail in 3D generation by introducing a multi-view conditioned diffusion model that leverages pseudo multi-view priors to guide a fast SDS refinement. A time-fixed local feature extractor, cross-view 3D attention, and data augmentation enable robust extraction of 3D priors from inconsistent views, while an Anchor Iterative Update loss stabilizes the refinement. The pipeline converts coarse 3D inputs (e.g., Instant3D) into differentiable representations (via fast NeRF) and optimizes with SDS over a short horizon, achieving high-fidelity geometry and textures in around 15 minutes. Empirical results on image-to-3D generation and novel view synthesis demonstrate improved quality, stronger identity preservation, and faster inference, with the method being plug-in compatible with various pseudo multi-view priors and backbones.

Abstract

Benefiting from the rapid development of 2D diffusion models, 3D content generation has witnessed significant progress. One promising solution is to finetune the pre-trained 2D diffusion models to produce multi-view images and then reconstruct them into 3D assets via feed-forward sparse-view reconstruction models. However, limited by the 3D inconsistency in the generated multi-view images and the low reconstruction resolution of the feed-forward reconstruction models, the generated 3d assets are still limited to incorrect geometries and blurry textures. To address this problem, we present a multi-view based refine method, named Magic-Boost, to further refine the generation results. In detail, we first propose a novel multi-view conditioned diffusion model which extracts 3d prior from the synthesized multi-view images to synthesize high-fidelity novel view images and then introduce a novel iterative-update strategy to adopt it to provide precise guidance to refine the coarse generated results through a fast optimization process. Conditioned on the strong 3d priors extracted from the synthesized multi-view images, Magic-Boost is capable of providing precise optimization guidance that well aligns with the coarse generated 3D assets, enriching the local detail in both geometry and texture within a short time ($\sim15$min). Extensive experiments show Magic-Boost greatly enhances the coarse generated inputs, generates high-quality 3D assets with rich geometric and textural details. (Project Page: https://magic-research.github.io/magic-boost/)

Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion

TL;DR

Magic-Boost tackles instability and low detail in 3D generation by introducing a multi-view conditioned diffusion model that leverages pseudo multi-view priors to guide a fast SDS refinement. A time-fixed local feature extractor, cross-view 3D attention, and data augmentation enable robust extraction of 3D priors from inconsistent views, while an Anchor Iterative Update loss stabilizes the refinement. The pipeline converts coarse 3D inputs (e.g., Instant3D) into differentiable representations (via fast NeRF) and optimizes with SDS over a short horizon, achieving high-fidelity geometry and textures in around 15 minutes. Empirical results on image-to-3D generation and novel view synthesis demonstrate improved quality, stronger identity preservation, and faster inference, with the method being plug-in compatible with various pseudo multi-view priors and backbones.

Abstract

Benefiting from the rapid development of 2D diffusion models, 3D content generation has witnessed significant progress. One promising solution is to finetune the pre-trained 2D diffusion models to produce multi-view images and then reconstruct them into 3D assets via feed-forward sparse-view reconstruction models. However, limited by the 3D inconsistency in the generated multi-view images and the low reconstruction resolution of the feed-forward reconstruction models, the generated 3d assets are still limited to incorrect geometries and blurry textures. To address this problem, we present a multi-view based refine method, named Magic-Boost, to further refine the generation results. In detail, we first propose a novel multi-view conditioned diffusion model which extracts 3d prior from the synthesized multi-view images to synthesize high-fidelity novel view images and then introduce a novel iterative-update strategy to adopt it to provide precise guidance to refine the coarse generated results through a fast optimization process. Conditioned on the strong 3d priors extracted from the synthesized multi-view images, Magic-Boost is capable of providing precise optimization guidance that well aligns with the coarse generated 3D assets, enriching the local detail in both geometry and texture within a short time (min). Extensive experiments show Magic-Boost greatly enhances the coarse generated inputs, generates high-quality 3D assets with rich geometric and textural details. (Project Page: https://magic-research.github.io/magic-boost/)
Paper Structure (16 sections, 1 equation, 11 figures, 2 tables)

This paper contains 16 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Provided with an input image and its coarse 3D generation, MagicBoost effectively boosts it to a high-quality 3D asset within 15 minutes. From left to right, we show the input image, pesudo multi-view images and coarse 3D results from Instant3D li2023instant3d, together with the significantly improved results produced by our method.
  • Figure 2: The overall pipeline. The proposed Magic-Boost could be a plug-in module plugged into any 3D generation methods capable of providing pseudo multi-view priors, such as Instant3D li2023instant3d, InstantMesh xu2024instantmesh, LGM tang2024lgm and etc. Benefit from the strong 3d priors provided by the pesudo multi-view images, Magic-boost provides precise SDS guidance, significantly enhancing the coarse 3D outputs within a brief interval ($\sim15$min).
  • Figure 3: Architecture of our multi-view conditioned diffusion model. At the core of our model lies the extraction of dense local features facilitated by a denoising U-Net operating at a fixed timestep. Concurrently, we harness a frozen CLIP ViT encoder to distill high-level signals. The original 2D self-attention layer is extended into 3D by concatenating keys and values across various views. To further control the condition strength of different views, we involve a control label which allows users to manually control the condition strength of different conditional views.
  • Figure 4: Illustration of the anchor iterative update loss. In detail, we regard the input pesudo multi-view inputs as our initial anchor datasets and adopt an update strategy by first rendering anchor view image, perturbing the image with random noise and then apply a multi-step denoising process with the proposed multi-view condition diffusion model to refine the anchor images. The refined anchor images are then used to supervise the generation with MSE loss to eliminate the over-stature problem during the SDS optimization process poole2022Dreamfusion.
  • Figure 5: Qualitative Comparison between our method with Imagedream wang2023imagedream and base sparse-view reconstruction model li2023instant3d. SVR denotes Sparse-View Reconstruction.
  • ...and 6 more figures