Table of Contents
Fetching ...

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

TL;DR

PrITTI introduces a primitive-based framework for large-scale, controllable 3D semantic urban scene generation, combining ground raster maps with vectorized object primitives in a two-stage LVAE+latent-diffusion pipeline. The LVAE encodes a disentangled ground-and-objects latent representation, which a diffusion transformer then expands into diverse, high-quality layouts; RePaint-inspired latent manipulation enables editing tasks without retraining. Experiments on KITTI-360 show competitive reconstruction with substantially lower memory than voxel baselines and superior generation quality and editability, including scene inpainting, outpainting, extrapolation, and photo-realistic street-view synthesis. The approach yields intuitive object-level edits and scalable scene generation, albeit with limitations in geometric fidelity and open-vocabulary extension, pointing to future work in expressive primitives, appearance rendering, and dynamic environments.

Abstract

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. Code and models are publicly available at $\href{https://raniatze.github.io/pritti/}{https://raniatze.github.io/pritti}$.

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

TL;DR

PrITTI introduces a primitive-based framework for large-scale, controllable 3D semantic urban scene generation, combining ground raster maps with vectorized object primitives in a two-stage LVAE+latent-diffusion pipeline. The LVAE encodes a disentangled ground-and-objects latent representation, which a diffusion transformer then expands into diverse, high-quality layouts; RePaint-inspired latent manipulation enables editing tasks without retraining. Experiments on KITTI-360 show competitive reconstruction with substantially lower memory than voxel baselines and superior generation quality and editability, including scene inpainting, outpainting, extrapolation, and photo-realistic street-view synthesis. The approach yields intuitive object-level edits and scalable scene generation, albeit with limitations in geometric fidelity and open-vocabulary extension, pointing to future work in expressive primitives, appearance rendering, and dynamic environments.

Abstract

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. Code and models are publicly available at .

Paper Structure

This paper contains 30 sections, 20 equations, 27 figures, 7 tables.

Figures (27)

  • Figure 1: PrITTI generates (1) high-quality, controllable 3D semantic urban scenes in a compact primitive-based representation using a latent diffusion model. Starting from a generated scene (e.g. middle sample), we demonstrate downstream applications including (2) scene editing, (3) inpainting, (4) outpainting, and (5) photo-realistic street view synthesis.
  • Figure 2: Training Overview. An input 3D semantic layout $\mathcal{S}$ comprises object primitives, encoded as feature vectors $\mathbf{F}$, and extruded ground polygons, rasterized into height maps $\mathbf{H}$ and binary occupancy masks $\mathbf{B}$ (Sec. \ref{['subsec:3d_scene_layout']}). A layout VAE with separate encoder-decoder pairs for objects ($\mathcal{E}_\mathcal{O}$/$\mathcal{D}_\mathcal{O}$) and ground ($\mathcal{E}_\mathcal{G}$/$\mathcal{D}_\mathcal{G}$) first compresses $\mathcal{S}$ into a structured latent representation $\mathbf{z}_\mathcal{L}$ (Sec. \ref{['subsec:lvae']}). In the second stage, a diffusion model is trained over this latent space for controllable scene generation (Sec. \ref{['subsec:dit']}). At inference, the diffusion model generates latent codes either unconditionally or conditioned on the scene label $y$, which are then decoded by the VAE into novel 3D layouts.
  • Figure 3: Stage 1: Qualitative reconstruction results on the same test scenes shown in each method’s native representation: primitives for PrITTI and voxel grids for SemCity and XCube. Voxel-based methods sometimes yield incomplete geometry and grid-induced distortions, such as vertical clipping at tall primitives.
  • Figure 4: Cholesky vs. quaternion encodings across training sizes.
  • Figure 5: 3D Semantic Scene Generation. Comparison of 3D semantic layouts generated by PrITTI (Ours, left) and voxel-based baselines (SemCity, PDD, XCube, right). PrITTI enables controllable generation, here conditioned on vegetation density (low, medium, or high), and produces more realistic, well-shaped scenes with clearer object boundaries. All baseline samples are generated unconditionally.
  • ...and 22 more figures