Interactive Scene Authoring with Specialized Generative Primitives
Clément Jambon, Changwoon Choi, Dongsu Zhang, Olga Sorkine-Hornung, Young Min Kim
TL;DR
This work introduces Specialized Generative Primitives to enable non-experts to author complex 3D scenes by converting casual video into a high-fidelity 3D appearance model based on $N$ 3D Gaussians with explicit parameters. Appearance is decoupled from generation through semantically enriched sparse voxels and a single-exemplar Generative Cellular Automata (GCA) trained in under 10 minutes, followed by a sparse patch consistency refinement that maps generated voxels to explicit Gaussians. An interactive editor supports multiple conditioning modalities (exemplar brushes, meshes, voxel edits) and real-time composition, enabling efficient appearance transfer, geometry edits, and multi-primitive scene authoring with generation times of $0.5$–$2$ seconds per primitive. The approach emphasizes edge-device deployment and interactive creativity, while acknowledging limitations from single-exemplar training and suggesting future work to bridge with large priors and improve automatic resolution selection. A dataset of Specialized Generative Primitives demonstrates diverse, real-world applicability across object-level to large-scale scenes and supports iterative, compositional workflows for 3D content creation.
Abstract
Generating high-quality 3D digital assets often requires expert knowledge of complex design tools. We introduce Specialized Generative Primitives, a generative framework that allows non-expert users to author high-quality 3D scenes in a seamless, lightweight, and controllable manner. Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world. With our framework, users capture a video of an environment, which we turn into a high-quality and explicit appearance model thanks to 3D Gaussian Splatting. Users then select regions of interest guided by semantically-aware features. To create a generative primitive, we adapt Generative Cellular Automata to single-exemplar training and controllable generation. We decouple the generative task from the appearance model by operating on sparse voxels and we recover a high-quality output with a subsequent sparse patch consistency step. Each primitive can be trained within 10 minutes and used to author new scenes interactively in a fully compositional manner. We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes. We also demonstrate additional capabilities of our primitives: handling various 3D representations to control generation, transferring appearances, and editing geometries.
