COMOGen: A Controllable Text-to-3D Multi-object Generation Framework
Shaorong Sun, Shuchao Pang, Yazhou Yao, Xiaoshui Huang
TL;DR
COMOGen tackles the challenge of controllable text-to-3D generation for multiple objects by introducing a three-module framework that leverages layout and multi-view priors. It introduces Layout-SDS, MV-SDS, and Layout Multi-view Score Distillation (LMSD) to fuse layout-driven placement with multi-view consistency, and refines 3D content via a COLA-based fine-tuning pipeline. The approach achieves superior multi-object layout fidelity, diversity, and alignment against state-of-the-art baselines as demonstrated through CLIP, T3Bench, and user studies, highlighting its potential for scalable, bounding-box-guided 3D content creation. While promising, the authors note evaluation limitations for multi-object scenarios and the need to extend 2D bounding-box descriptions to capture depth (z-axis) information in future work.
Abstract
The controllability of 3D object generation methods is achieved through input text. Existing text-to-3D object generation methods primarily focus on generating a single object based on a single object description. However, these methods often face challenges in producing results that accurately correspond to our desired positions when the input text involves multiple objects. To address the issue of controllability in generating multiple objects, this paper introduces COMOGen, a COntrollable text-to-3D Multi-Object Generation framework. COMOGen enables the simultaneous generation of multiple 3D objects by the distillation of layout and multi-view prior knowledge. The framework consists of three modules: the layout control module, the multi-view consistency control module, and the 3D content enhancement module. Moreover, to integrate these three modules as an integral framework, we propose Layout Multi-view Score Distillation, which unifies two prior knowledge and further enhances the diversity and quality of generated 3D content. Comprehensive experiments demonstrate the effectiveness of our approach compared to the state-of-the-art methods, which represents a significant step forward in enabling more controlled and versatile text-based 3D content generation.
