Table of Contents
Fetching ...

Interactive Scene Authoring with Specialized Generative Primitives

Clément Jambon, Changwoon Choi, Dongsu Zhang, Olga Sorkine-Hornung, Young Min Kim

TL;DR

This work introduces Specialized Generative Primitives to enable non-experts to author complex 3D scenes by converting casual video into a high-fidelity 3D appearance model based on $N$ 3D Gaussians with explicit parameters. Appearance is decoupled from generation through semantically enriched sparse voxels and a single-exemplar Generative Cellular Automata (GCA) trained in under 10 minutes, followed by a sparse patch consistency refinement that maps generated voxels to explicit Gaussians. An interactive editor supports multiple conditioning modalities (exemplar brushes, meshes, voxel edits) and real-time composition, enabling efficient appearance transfer, geometry edits, and multi-primitive scene authoring with generation times of $0.5$–$2$ seconds per primitive. The approach emphasizes edge-device deployment and interactive creativity, while acknowledging limitations from single-exemplar training and suggesting future work to bridge with large priors and improve automatic resolution selection. A dataset of Specialized Generative Primitives demonstrates diverse, real-world applicability across object-level to large-scale scenes and supports iterative, compositional workflows for 3D content creation.

Abstract

Generating high-quality 3D digital assets often requires expert knowledge of complex design tools. We introduce Specialized Generative Primitives, a generative framework that allows non-expert users to author high-quality 3D scenes in a seamless, lightweight, and controllable manner. Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world. With our framework, users capture a video of an environment, which we turn into a high-quality and explicit appearance model thanks to 3D Gaussian Splatting. Users then select regions of interest guided by semantically-aware features. To create a generative primitive, we adapt Generative Cellular Automata to single-exemplar training and controllable generation. We decouple the generative task from the appearance model by operating on sparse voxels and we recover a high-quality output with a subsequent sparse patch consistency step. Each primitive can be trained within 10 minutes and used to author new scenes interactively in a fully compositional manner. We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes. We also demonstrate additional capabilities of our primitives: handling various 3D representations to control generation, transferring appearances, and editing geometries.

Interactive Scene Authoring with Specialized Generative Primitives

TL;DR

This work introduces Specialized Generative Primitives to enable non-experts to author complex 3D scenes by converting casual video into a high-fidelity 3D appearance model based on 3D Gaussians with explicit parameters. Appearance is decoupled from generation through semantically enriched sparse voxels and a single-exemplar Generative Cellular Automata (GCA) trained in under 10 minutes, followed by a sparse patch consistency refinement that maps generated voxels to explicit Gaussians. An interactive editor supports multiple conditioning modalities (exemplar brushes, meshes, voxel edits) and real-time composition, enabling efficient appearance transfer, geometry edits, and multi-primitive scene authoring with generation times of seconds per primitive. The approach emphasizes edge-device deployment and interactive creativity, while acknowledging limitations from single-exemplar training and suggesting future work to bridge with large priors and improve automatic resolution selection. A dataset of Specialized Generative Primitives demonstrates diverse, real-world applicability across object-level to large-scale scenes and supports iterative, compositional workflows for 3D content creation.

Abstract

Generating high-quality 3D digital assets often requires expert knowledge of complex design tools. We introduce Specialized Generative Primitives, a generative framework that allows non-expert users to author high-quality 3D scenes in a seamless, lightweight, and controllable manner. Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world. With our framework, users capture a video of an environment, which we turn into a high-quality and explicit appearance model thanks to 3D Gaussian Splatting. Users then select regions of interest guided by semantically-aware features. To create a generative primitive, we adapt Generative Cellular Automata to single-exemplar training and controllable generation. We decouple the generative task from the appearance model by operating on sparse voxels and we recover a high-quality output with a subsequent sparse patch consistency step. Each primitive can be trained within 10 minutes and used to author new scenes interactively in a fully compositional manner. We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes. We also demonstrate additional capabilities of our primitives: handling various 3D representations to control generation, transferring appearances, and editing geometries.

Paper Structure

This paper contains 46 sections, 12 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Overview of the user workflow (first row) and the underlying technical components (second row). Preparation phase. A user captures a 3D scene that is reconstructed as 3D Gaussians augmented with DINO features. With the guidance of these features, the user can extract the region of interest as well as optional "exemplar brushes". A hierarchy of sparse voxels is built from the former and used to train a "specialized primitive" using GCA at a target resolution conditioned on a given coarse resolution. Authoring phase. The user can author content with this primitive through multiple modalities ("exemplar brush", mesh, or direct voxel editing) that are converted into coarse voxels. The output of each primitive can be freely composited with other primitives or static 3D Gaussian regions. Under the hood, each primitive samples a featurized grid of voxels at the target resolution from the coarse conditioning voxels through GCA. These voxels are then remapped to the set of 3D Gaussians in the exemplar with our sparse patch consistency step.
  • Figure 2: Illustration of our user interface. (a) Our interactive viewer runs at real-time framerates (30-60fps) and comes with a selection tool using the quantized DINO features. Additional adjustments can be made at any stage with a manual selection tool. (b) Conditioning can be performed using "exemplar brushes," voxelized 3D meshes, or direct voxel editing. (c) From this coarse conditioning signal, diverse assets can be generated using GCA and a subsequent patch consistency step. (d) Multiple generated assets can be composited into a single scene within our editor.
  • Figure 3: From a selected scene, "exemplar brushes" representing regions of the scene, such as the waterfall or rocks, can be extracted within our editor. Authoring can then be performed coarsely by resampling these primitives and/or directly editing the voxels.
  • Figure 4: Starting from an arbitrary set of anisotropic 3D Gaussians (left), we build a hierarchy of voxels, each bearing a unique feature. The finest level is progressively downsampled to produce coarser levels. Ultimately, only two levels will be used: a coarse conditioning resolution and a finer target resolution for generating sparse voxels. Features are only retained for the target resolution.
  • Figure 5: Averaging features of Gaussians within a given voxel tends to "wash out" statistics (center). We thus prefer to pick the feature of a single representative Gaussian (right), which we choose to be the highest opacity one. In this figure, we show this at a coarse resolution and on RGB colors directly for visualization purposes.
  • ...and 17 more figures