Table of Contents
Fetching ...

CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting

Alexander Vilesov, Pradyumna Chari, Achuta Kadambi

TL;DR

CG3D introduces an explicit compositional framework for text-to-3D generation using Gaussian radiance fields, enabling scalable multi-object scenes with physically realistic interactions. It combines structured scene graphs with Score Distillation Sampling guidance and a physics-based finetuning stage to ensure gravity and contact constraints are respected. The approach supports object-level editing, radiance-field distillation for memory efficiency, and extensive ablations, achieving strong performance against baselines in zero-shot compositional tasks. By decoupling objects and interactions and leveraging explicit geometry, CG3D enables flexible, editable, and physically plausible 3D scene creation from text prompts without retraining diffusion models.

Abstract

With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.

CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting

TL;DR

CG3D introduces an explicit compositional framework for text-to-3D generation using Gaussian radiance fields, enabling scalable multi-object scenes with physically realistic interactions. It combines structured scene graphs with Score Distillation Sampling guidance and a physics-based finetuning stage to ensure gravity and contact constraints are respected. The approach supports object-level editing, radiance-field distillation for memory efficiency, and extensive ablations, achieving strong performance against baselines in zero-shot compositional tasks. By decoupling objects and interactions and leveraging explicit geometry, CG3D enables flexible, editable, and physically plausible 3D scene creation from text prompts without retraining diffusion models.

Abstract

With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.
Paper Structure (54 sections, 20 equations, 23 figures, 2 tables, 1 algorithm)

This paper contains 54 sections, 20 equations, 23 figures, 2 tables, 1 algorithm.

Figures (23)

  • Figure 1: We realize multi-object scenes through a Gaussian radiance field. Pseudocode to enable compositionality in Gaussian radiance fields incorporating rotation, translation, and scale to convert 3D Gaussians from object to composition coordinates.
  • Figure 2: Our method achieves compositional generation through ancestral sampling of a PGM of the scene. We first sample objects followed by their pairwise interactions.
  • Figure 3: Gradient descent optimization is poorly conditioned for estimating optimal$\mathbf{R}_{2,1}$, $s_{2,1}$and$\mathbf{t}_{2,1}$. Here, we show an anomaly in the SDS loss for unnaturally small $s_{2,1}$. Similar anomalies exist in the estimation of $\mathbf{t}_{2,1}$.
  • Figure 4: Diffusion models, such as Stable Diffusion v2.1 rombach2022high are unable to always adhere to physical laws such as gravity, even for image generation. Additional physical guidance is required for realistic-looking scene compositions.
  • Figure 5: Explicit representations enable physically realistic scene composition. Consider spherical objects, made up of several Gaussians (represented by colored ellipses). (a) The gravity loss provides a gradient to move the object to the virtual floor without considerably penetrating the floor. (b) The contact loss prevent objects from unrealistically intersecting with each other, by minimizing the angle $\theta_c$ for intersecting points.
  • ...and 18 more figures