Table of Contents
Fetching ...

Distilling Multi-view Diffusion Models into 3D Generators

Hao Qin, Luyuan Chen, Ming Kong, Mengxu Lu, Qiang Zhu

TL;DR

DD3G tackles the challenge of generating 3D content from a single image by distilling knowledge from a pre-trained multi-view diffusion model into a fast 3D Gaussian generator. The core innovation is the two-phase PEPD (Pattern Extraction and Progressive Decoding) that lifts 2D views into a 3D Gaussian representation while preserving the probabilistic flow via a deterministic DDIM trajectory and a joint optimization strategy combining explicit supervision with implicit verification. A 120k RGBA image dataset supports the distillation, enabling rapid inference (~0.06 seconds) and robust generalization across synthetic and real-world photographs. The approach achieves superior geometry and view-consistency compared with baselines, and demonstrates the potential for scalable, texture-rich 3D generation without requiring 3D data during distillation.

Abstract

We introduce DD3G, a formulation that Distills a multi-view Diffusion model (MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and integrates extensive visual and spatial geometric knowledge from the MV-DM by simulating its ordinary differential equation (ODE) trajectory, ensuring the distilled generator generalizes better than those trained solely on 3D data. Unlike previous amortized optimization approaches, we align the MV-DM and 3D generator representation spaces to transfer the teacher's probabilistic flow to the student, thus avoiding inconsistencies in optimization objectives caused by probabilistic sampling. The introduction of probabilistic flow and the coupling of various attributes in 3D Gaussians introduce challenges in the generation process. To tackle this, we propose PEPD, a generator consisting of Pattern Extraction and Progressive Decoding phases, which enables efficient fusion of probabilistic flow and converts a single image into 3D Gaussians within 0.06 seconds. Furthermore, to reduce knowledge loss and overcome sparse-view supervision, we design a joint optimization objective that ensures the quality of generated samples through explicit supervision and implicit verification. Leveraging existing 2D generation models, we compile 120k high-quality RGBA images for distillation. Experiments on synthetic and public datasets demonstrate the effectiveness of our method. Our project is available at: https://qinbaigao.github.io/DD3G_project/

Distilling Multi-view Diffusion Models into 3D Generators

TL;DR

DD3G tackles the challenge of generating 3D content from a single image by distilling knowledge from a pre-trained multi-view diffusion model into a fast 3D Gaussian generator. The core innovation is the two-phase PEPD (Pattern Extraction and Progressive Decoding) that lifts 2D views into a 3D Gaussian representation while preserving the probabilistic flow via a deterministic DDIM trajectory and a joint optimization strategy combining explicit supervision with implicit verification. A 120k RGBA image dataset supports the distillation, enabling rapid inference (~0.06 seconds) and robust generalization across synthetic and real-world photographs. The approach achieves superior geometry and view-consistency compared with baselines, and demonstrates the potential for scalable, texture-rich 3D generation without requiring 3D data during distillation.

Abstract

We introduce DD3G, a formulation that Distills a multi-view Diffusion model (MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and integrates extensive visual and spatial geometric knowledge from the MV-DM by simulating its ordinary differential equation (ODE) trajectory, ensuring the distilled generator generalizes better than those trained solely on 3D data. Unlike previous amortized optimization approaches, we align the MV-DM and 3D generator representation spaces to transfer the teacher's probabilistic flow to the student, thus avoiding inconsistencies in optimization objectives caused by probabilistic sampling. The introduction of probabilistic flow and the coupling of various attributes in 3D Gaussians introduce challenges in the generation process. To tackle this, we propose PEPD, a generator consisting of Pattern Extraction and Progressive Decoding phases, which enables efficient fusion of probabilistic flow and converts a single image into 3D Gaussians within 0.06 seconds. Furthermore, to reduce knowledge loss and overcome sparse-view supervision, we design a joint optimization objective that ensures the quality of generated samples through explicit supervision and implicit verification. Leveraging existing 2D generation models, we compile 120k high-quality RGBA images for distillation. Experiments on synthetic and public datasets demonstrate the effectiveness of our method. Our project is available at: https://qinbaigao.github.io/DD3G_project/

Paper Structure

This paper contains 20 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: DD3G can distill visual knowledge from MV-DM into the 3D Gaussian generator to achieve rapid and generalized high-quality 3D generation.
  • Figure 2: DD3G trains the 3D Gaussian generator PEPD to lift a single image into a 3D object. Given the offline collected {N, C, II, OI} (noise, camera pose, input image, output multi-view images) quadruples as training samples, the Pattern Extraction (PE) phase extracts the lifting pattern of II from random information NC, as the general guidance for the Progressive Decoding (PD) phase to decouple the 3D Gaussian attributes progressively. Furthermore, the joint optimization objective that combines explicit supervision with implicit verification is formed to improve the quality of generated samples.
  • Figure 3: Overview of the synthetic image collection process. We adopt the same method as in zou2024triplane to extract the foreground objects in images.
  • Figure 4: Qualitative comparisons between DD3G and other baselines. Our method achieves significant advantages in overall geometric consistency, primarily due to the efficient utilization of visual knowledge from MV-DM.
  • Figure 5: Illustration of CLIP Similarity and User Study. The samples generated by PEPD are more aligned with human preferences.
  • ...and 5 more figures