CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting
Alexander Vilesov, Pradyumna Chari, Achuta Kadambi
TL;DR
CG3D introduces an explicit compositional framework for text-to-3D generation using Gaussian radiance fields, enabling scalable multi-object scenes with physically realistic interactions. It combines structured scene graphs with Score Distillation Sampling guidance and a physics-based finetuning stage to ensure gravity and contact constraints are respected. The approach supports object-level editing, radiance-field distillation for memory efficiency, and extensive ablations, achieving strong performance against baselines in zero-shot compositional tasks. By decoupling objects and interactions and leveraging explicit geometry, CG3D enables flexible, editable, and physically plausible 3D scene creation from text prompts without retraining diffusion models.
Abstract
With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.
