Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling
Junhong Lee, Seungwook Kim, Minsu Cho
TL;DR
This work addresses the gap in understanding how training-free diffusion guidance techniques, notably Classifier-Free Guidance (CFG) and FreeU, affect Score Distillation Sampling (SDS) used for text-to-3D generation via 2D lifting. It introduces a dynamic scaling scheme that schedules FreeU by diffusion timesteps and CFG by optimization iterations, enabling a balanced improvement of texture detail, surface smoothness, and geometric stability in 3D outputs. The approach is validated across SDS-based pipelines (including MVDream, DreamFusion, and Magic3D) and through a user study, showing improved perceived quality and CLIP-consistent results, with generalization to other SDS methods. The findings highlight practical implications for designing training-free interventions in diffusion-based 3D generation, offering a path toward high-fidelity multi-view 3D content with manageable artifacts.
Abstract
Recent studies show that simple training-free techniques can dramatically improve the quality of text-to-2D generation outputs, e.g. Classifier-Free Guidance (CFG) or FreeU. However, these training-free techniques have been underexplored in the lens of Score Distillation Sampling (SDS), which is a popular and effective technique to leverage the power of pretrained text-to-2D diffusion models for various tasks. In this paper, we aim to shed light on the effect such training-free techniques have on SDS, via a particular application of text-to-3D generation via 2D lifting. We present our findings, which show that varying the scales of CFG presents a trade-off between object size and surface smoothness, while varying the scales of FreeU presents a trade-off between texture details and geometric errors. Based on these findings, we provide insights into how we can effectively harness training-free techniques for SDS, via a strategic scaling of such techniques in a dynamic manner with respect to the timestep or optimization iteration step. We show that using our proposed scheme strikes a favorable balance between texture details and surface smoothness in text-to-3D generations, while preserving the size of the output and mitigating the occurrence of geometric defects.
