Table of Contents
Fetching ...

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Xiangyu Zhu, Zhen Lei, Lei Zhang

TL;DR

Asynchronous Score Distillation (ASD) is proposed, which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones and reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts.

Abstract

By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

TL;DR

Asynchronous Score Distillation (ASD) is proposed, which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones and reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts.

Abstract

By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.
Paper Structure (20 sections, 4 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Top two rows: Asynchronous Score Distillation (ASD) for prompt-specific text-to-3D generation. Bottom row: ASD for prompt-amortized generation, which learns a text-to-3D generator on multiple prompts without 3D ground truths. ASD has strong capability to scale up the training corpus to as much as 100k text prompts.
  • Figure 2: Illustration of the noise prediction error of the pre-trained 2D diffusion model $\boldsymbol{\epsilon}_{PT}(t)$ and that of the fine-tuned 2D diffusion model $\boldsymbol{\epsilon}_{FT}(t)$. We can see that the curve of $e_{FT}(t)$ is positioned under that of $e_{PT}(t)$, and we can shift the timestep of $\boldsymbol{\epsilon}_{PT}(t)$ to $\boldsymbol{\epsilon}_{PT}(t+\Delta t)$ to approximate the noise prediction error of $\boldsymbol{\epsilon}_{FT}(t)$.
  • Figure 3: Left and middle: 2D toy examples by SDS poole2022dreamfusion, CSD yu2023text, VSD wang2023prolificdreamer and our proposed ASD. Right: Gradient norms generated by different methods.
  • Figure 4: Overview of Asynchronous Score Distillation (ASD). As illustrated in the left sub-figure, ASD can be employed for prompt-specific generation by optimizing 3D representations for each prompt, as well as for prompt-amortized generation by training a text-to-3D generator. The right sub-figure depicts how ASD uses the difference in noise predictions at asynchronous timesteps to update the 3D network parameters.
  • Figure 5: Qualitative comparison on prompt-specific (with iNGP as the 3D representation) and prompt-amortized (with Hyper-iNGP as the 3D generator) text-to-3D results by SDS poole2022dreamfusion, CSD yu2023text, VSD wang2023prolificdreamer, ATT3D lorraine2023att3d and our ASD methods.
  • ...and 9 more figures