Dream-in-Style: Text-to-3D Generation Using Stylized Score Distillation
Hubert Kompanowski, Binh-Son Hua
TL;DR
This work tackles stylized text-to-3D generation by jointly optimizing geometry and appearance to match a text prompt and a style reference image. It introduces stylized score distillation, which blends scores from a pretrained text-to-image diffusion model and a modified variant with swapped attention features to inject style, enabling a single-stage optimization of neural radiance fields. The approach supports multiple SDS-based loss families (SDS, NFSD, VSD) through a unified framework and demonstrates strong qualitative and quantitative performance, including a GPT-based user evaluation. Limitations include the Janus problem with single-view diffusion priors, mitigated by multi-view diffusion models in follow-up work, and challenges in hard or ambiguous styling cases. The method is computationally efficient on current GPUs and offers a flexible and robust pathway for stylized 3D content creation from text and style references.
Abstract
We present a method to generate 3D objects in styles. Our method takes a text prompt and a style reference image as input and reconstructs a neural radiance field to synthesize a 3D model with the content aligning with the text prompt and the style following the reference image. To simultaneously generate the 3D object and perform style transfer in one go, we propose a stylized score distillation loss to guide a text-to-3D optimization process to output visually plausible geometry and appearance. Our stylized score distillation is based on a combination of an original pretrained text-to-image model and its modified sibling with the key and value features of self-attention layers manipulated to inject styles from the reference image. Comparisons with state-of-the-art methods demonstrated the strong visual performance of our method, further supported by the quantitative results from our user study.
