Table of Contents
Fetching ...

ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua

TL;DR

This paper tackles mode-seeking failures in SDS-based text-to-3D generation by introducing Image Prompt Score Distillation (ISD), which uses a reference image prompt via IP-Adapter to bias optimization toward a chosen mode in the diffusion prior. It couples ISD with a strong variance-reducing control variate and multi-view regularization (SDS-MVD) to improve stability, geometry, and texture, while accelerating optimization. The authors analyze SDS mode behavior, demonstrate improved mode control through image prompts, and validate on T3Bench and GPTEval3D, achieving competitive or state-of-the-art results with faster convergence. The approach yields high-quality, diverse 3D assets with better view-consistency, enabling more practical and scalable text-to-3D generation in real-world applications.

Abstract

Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.

ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

TL;DR

This paper tackles mode-seeking failures in SDS-based text-to-3D generation by introducing Image Prompt Score Distillation (ISD), which uses a reference image prompt via IP-Adapter to bias optimization toward a chosen mode in the diffusion prior. It couples ISD with a strong variance-reducing control variate and multi-view regularization (SDS-MVD) to improve stability, geometry, and texture, while accelerating optimization. The authors analyze SDS mode behavior, demonstrate improved mode control through image prompts, and validate on T3Bench and GPTEval3D, achieving competitive or state-of-the-art results with faster convergence. The approach yields high-quality, diverse 3D assets with better view-consistency, enabling more practical and scalable text-to-3D generation in real-world applications.

Abstract

Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.

Paper Structure

This paper contains 15 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our mode-guiding score distillation using the ISD loss explicitly selects a desired mode captured in a diffusion prior using a reference image prompt to steer text-to-3D generation. Our method leads to high-quality and diverse 3D generation. For each sub-figure, the top-left illustrates the reference image used for mode selection, and the remaining images illustrate different views of the generated object. The last row demonstrates the diversity of our 3D generation by using different reference images on the same text prompt.
  • Figure 2: Comparison between vanilla SDS and IP-SDS, IP-SDS can generate detail and sharp texture while original SDS cannot.
  • Figure 3: A 2D toy experiment. Our ISD can generate results similar to IP-VSD while having lower gradient variance. This advocates designing better control variates, discarding the need of training LoRA in an alternative fashion like in VSD.
  • Figure 4: Different control variate settings including Gaussian noise $\epsilon$, learning LoRA-Unet in VSD wang2023prolificdreamer, Unet at different timestep in ScaleDreamer ma2024scaledreamer, and our ISD. Our method can generate 3D objects on par with VSD without learning additional LoRA-Unet.
  • Figure 5: An overview of our method. Starting with input prompt $y$, we generate a reference image $x_{\text{ref}}$ using a text-to-image model. Both the text prompt and the image prompt are used with the IP-Adapter for score distillation, following our ISD gradient $\nabla_\theta \mathcal{L}_{\text{ISD}}$. To mitigate view bias by reference image and the Janus problem, we incorporate additional multi-view regularization by jointly optimizing $\nabla_\theta \mathcal{L}_{\text{ISD}}$ with $\nabla_\theta \mathcal{L}_{\text{SDS-MVD}}$.
  • ...and 8 more figures