Table of Contents
Fetching ...

Dream-in-Style: Text-to-3D Generation Using Stylized Score Distillation

Hubert Kompanowski, Binh-Son Hua

TL;DR

This work tackles stylized text-to-3D generation by jointly optimizing geometry and appearance to match a text prompt and a style reference image. It introduces stylized score distillation, which blends scores from a pretrained text-to-image diffusion model and a modified variant with swapped attention features to inject style, enabling a single-stage optimization of neural radiance fields. The approach supports multiple SDS-based loss families (SDS, NFSD, VSD) through a unified framework and demonstrates strong qualitative and quantitative performance, including a GPT-based user evaluation. Limitations include the Janus problem with single-view diffusion priors, mitigated by multi-view diffusion models in follow-up work, and challenges in hard or ambiguous styling cases. The method is computationally efficient on current GPUs and offers a flexible and robust pathway for stylized 3D content creation from text and style references.

Abstract

We present a method to generate 3D objects in styles. Our method takes a text prompt and a style reference image as input and reconstructs a neural radiance field to synthesize a 3D model with the content aligning with the text prompt and the style following the reference image. To simultaneously generate the 3D object and perform style transfer in one go, we propose a stylized score distillation loss to guide a text-to-3D optimization process to output visually plausible geometry and appearance. Our stylized score distillation is based on a combination of an original pretrained text-to-image model and its modified sibling with the key and value features of self-attention layers manipulated to inject styles from the reference image. Comparisons with state-of-the-art methods demonstrated the strong visual performance of our method, further supported by the quantitative results from our user study.

Dream-in-Style: Text-to-3D Generation Using Stylized Score Distillation

TL;DR

This work tackles stylized text-to-3D generation by jointly optimizing geometry and appearance to match a text prompt and a style reference image. It introduces stylized score distillation, which blends scores from a pretrained text-to-image diffusion model and a modified variant with swapped attention features to inject style, enabling a single-stage optimization of neural radiance fields. The approach supports multiple SDS-based loss families (SDS, NFSD, VSD) through a unified framework and demonstrates strong qualitative and quantitative performance, including a GPT-based user evaluation. Limitations include the Janus problem with single-view diffusion priors, mitigated by multi-view diffusion models in follow-up work, and challenges in hard or ambiguous styling cases. The method is computationally efficient on current GPUs and offers a flexible and robust pathway for stylized 3D content creation from text and style references.

Abstract

We present a method to generate 3D objects in styles. Our method takes a text prompt and a style reference image as input and reconstructs a neural radiance field to synthesize a 3D model with the content aligning with the text prompt and the style following the reference image. To simultaneously generate the 3D object and perform style transfer in one go, we propose a stylized score distillation loss to guide a text-to-3D optimization process to output visually plausible geometry and appearance. Our stylized score distillation is based on a combination of an original pretrained text-to-image model and its modified sibling with the key and value features of self-attention layers manipulated to inject styles from the reference image. Comparisons with state-of-the-art methods demonstrated the strong visual performance of our method, further supported by the quantitative results from our user study.

Paper Structure

This paper contains 32 sections, 20 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: We aim to generate a 3D object jointly from a text prompt and a style reference image so that the object integrates the descriptive elements in the text prompt and the aesthetic style in the reference image. Our method employs a stylized score distillation to steer a text-to-3D optimization using combined scores from a pretrained text-to-image model and its modified variant on attention layer features, generating stylized 3D objects in a single-stage optimization.
  • Figure 2: Qualitative comparisons to baseline methods. Our stylized score distillation leads to 3D generation results consistently aligned with the text prompts and the style reference images. Multiple view rendering are shown in the supplementary video.
  • Figure 3: Our method applied to different score distillation losses. Multiple view rendering are shown in the supplementary video.
  • Figure 4: Our results on various style images from hertz2023stylealigned.
  • Figure 5: Our results on complex and detailed text prompts.
  • ...and 7 more figures