Semantic Score Distillation Sampling for Compositional Text-to-3D Generation
Ling Yang, Zixiang Zhang, Junlin Han, Bohan Zeng, Runjia Li, Philip Torr, Wentao Zhang
TL;DR
SemanticSDS tackles the challenge of compositional text-to-3D generation under limited 3D data by introducing explicit semantic guidance into Score Distillation Sampling. It integrates program-aided layout planning, expressive semantic embeddings, and a semantic-map-driven, region-wise SDS that operates on a 3D Gaussian Splatting representation to enable fine-grained control over multi-object scenes. Empirical results show state-of-the-art performance on complex scenes with improved prompt alignment, spatial arrangement, geometric fidelity, and scene quality, validated by both quantitative metrics (e.g., CLIP) and GPT-4V-based human assessments. This work enhances the practicality of diffusion priors for 3D content creation and lays groundwork for future editing and closed-loop refinement of 3D assets.
Abstract
Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research. Due to the scarcity of 3D data, state-of-the-art approaches utilize pre-trained 2D diffusion priors, optimized through Score Distillation Sampling (SDS). Despite progress, crafting complex 3D scenes featuring multiple objects or intricate interactions is still difficult. To tackle this, recent methods have incorporated box or layout guidance. However, these layout-guided compositional methods often struggle to provide fine-grained control, as they are generally coarse and lack expressiveness. To overcome these challenges, we introduce a novel SDS approach, Semantic Score Distillation Sampling (SemanticSDS), designed to effectively improve the expressiveness and accuracy of compositional text-to-3D generation. Our approach integrates new semantic embeddings that maintain consistency across different rendering views and clearly differentiate between various objects and parts. These embeddings are transformed into a semantic map, which directs a region-specific SDS process, enabling precise optimization and compositional generation. By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models, thereby achieving superior quality in 3D content generation, particularly for complex objects and scenes. Experimental results demonstrate that our SemanticSDS framework is highly effective for generating state-of-the-art complex 3D content. Code: https://github.com/YangLing0818/SemanticSDS-3D
