ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts
Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua
TL;DR
This paper tackles mode-seeking failures in SDS-based text-to-3D generation by introducing Image Prompt Score Distillation (ISD), which uses a reference image prompt via IP-Adapter to bias optimization toward a chosen mode in the diffusion prior. It couples ISD with a strong variance-reducing control variate and multi-view regularization (SDS-MVD) to improve stability, geometry, and texture, while accelerating optimization. The authors analyze SDS mode behavior, demonstrate improved mode control through image prompts, and validate on T3Bench and GPTEval3D, achieving competitive or state-of-the-art results with faster convergence. The approach yields high-quality, diverse 3D assets with better view-consistency, enabling more practical and scalable text-to-3D generation in real-world applications.
Abstract
Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.
