Table of Contents
Fetching ...

Octree Diffusion for Semantic Scene Generation and Completion

Xujia Zhang, Brendan Crowe, Christoffer Heckman

Abstract

The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them one-off. Additionally, these approaches are often domain-specific, requiring separate models for different data distributions, e.g.\ indoor vs.\ outdoor scenes. To unify these techniques and provide cross-domain compatibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxel-level semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate high-quality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.

Octree Diffusion for Semantic Scene Generation and Completion

Abstract

The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them one-off. Additionally, these approaches are often domain-specific, requiring separate models for different data distributions, e.g.\ indoor vs.\ outdoor scenes. To unify these techniques and provide cross-domain compatibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxel-level semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate high-quality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.

Paper Structure

This paper contains 18 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our method unifies unconditional generation and LiDAR-conditioned completion within the same framework. When available, a LiDAR scan can be voxelized and used to initialize the latent structure; partial semantic voxels, if provided, are encoded and used to anchor node latents. Both structure and semantics are then introduced during postconditioned diffusion sampling, which preserves observed regions while freely synthesizing unobserved areas. When no LiDAR or semantic input is provided, the masks default to all zeros, causing the model to perform unconditional generation and sample entire scenes from pure noise.
  • Figure 2: Left: A 2D rendering of an octree. Right: Corresponding dual octree graph.
  • Figure 3: The Patch-VAE pipeline. A semantic voxel representation of the scene is given as input. Next, the patch encoder does spatial compression over every non-empty patch in the semantic voxel map and forms an octree in the compressed space. Then, it is converted into a dual octree graph. The VAE encoder outputs a latent representation at a shallower depth graph. The VAE decoder utilizes a shared MLP head to predict the split signal to each node at the finest depth, and reconstructs the latent graph. Finally, the patch decoder converts the latent into a semantic voxel map.
  • Figure 4: Given a prespecified patch size, a shared patch encoder operates on every single patch from the semantic scene, and output a latent vector for each patch.
  • Figure 5: Scene generation results via our two-stage pipeline. Top: Two example semantic scene generations Bottom: Example semantic scene extension. Left of the dotted line is an input semantic scene, to the right is an extension of the scene via outpainting.
  • ...and 4 more figures