PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

Christina Ourania Tze; Daniel Dauner; Yiyi Liao; Dzmitry Tsishkou; Andreas Geiger

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

TL;DR

PrITTI introduces a primitive-based framework for large-scale, controllable 3D semantic urban scene generation, combining ground raster maps with vectorized object primitives in a two-stage LVAE+latent-diffusion pipeline. The LVAE encodes a disentangled ground-and-objects latent representation, which a diffusion transformer then expands into diverse, high-quality layouts; RePaint-inspired latent manipulation enables editing tasks without retraining. Experiments on KITTI-360 show competitive reconstruction with substantially lower memory than voxel baselines and superior generation quality and editability, including scene inpainting, outpainting, extrapolation, and photo-realistic street-view synthesis. The approach yields intuitive object-level edits and scalable scene generation, albeit with limitations in geometric fidelity and open-vocabulary extension, pointing to future work in expressive primitives, appearance rendering, and dynamic environments.

Abstract

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. Code and models are publicly available at $\href{https://raniatze.github.io/pritti/}{https://raniatze.github.io/pritti}$.

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

TL;DR

Abstract

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (27)