Table of Contents
Fetching ...

External Knowledge Enhanced 3D Scene Generation from Sketch

Zijie Wu, Mingtao Feng, Yaonan Wang, He Xie, Weisheng Dong, Bo Miao, Ajmal Mian

TL;DR

This work addresses the challenge of generating realistic and diverse $3D$ scenes from sparse hand-drawn sketches. It introduces SEK, a diffusion-based framework conditioned on sketch cues and an external knowledge base of object relationships, augmented by a knowledge-enhanced graph reasoning module and a spectrum-filtered 3D denoiser. Key contributions include (i) constructing a relational knowledge base $KB=(\mathcal{V},\mathcal{R},p)$ from indoor scenes, (ii) a KeGR module that propagates knowledge through multistep graph convolutions to produce a graph feature $H^G$, (iii) a ViT-based sketch encoder that, together with $H^G$, forms the conditioning $c=[H^S,H^G]$ for diffusion, and (iv) a spectrum-filtered 3D scene denoiser that suppresses padding and enhances object signals. Experiments on the 3D-FRONT dataset show state-of-the-art performance in $3D$ scene generation and completion, with notable cross-dataset transferability to ScanNet, validating the practical impact of combining sketch guidance with external knowledge for controllable, plausible scene synthesis.

Abstract

Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries.We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution.We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41%, 37.18% in 3D scene generation and FID, KID by 19.12%, 20.06% in 3D scene completion compared to the nearest competitor DiffuScene.

External Knowledge Enhanced 3D Scene Generation from Sketch

TL;DR

This work addresses the challenge of generating realistic and diverse scenes from sparse hand-drawn sketches. It introduces SEK, a diffusion-based framework conditioned on sketch cues and an external knowledge base of object relationships, augmented by a knowledge-enhanced graph reasoning module and a spectrum-filtered 3D denoiser. Key contributions include (i) constructing a relational knowledge base from indoor scenes, (ii) a KeGR module that propagates knowledge through multistep graph convolutions to produce a graph feature , (iii) a ViT-based sketch encoder that, together with , forms the conditioning for diffusion, and (iv) a spectrum-filtered 3D scene denoiser that suppresses padding and enhances object signals. Experiments on the 3D-FRONT dataset show state-of-the-art performance in scene generation and completion, with notable cross-dataset transferability to ScanNet, validating the practical impact of combining sketch guidance with external knowledge for controllable, plausible scene synthesis.

Abstract

Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries.We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution.We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41%, 37.18% in 3D scene generation and FID, KID by 19.12%, 20.06% in 3D scene completion compared to the nearest competitor DiffuScene.
Paper Structure (12 sections, 13 equations, 5 figures, 5 tables)

This paper contains 12 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our method generates a 3D scene from an input sketch and entities, enhanced by external knowledge. It follows explicit visual cues in the sketch for visible objects along with their relationships and employs plausibility reasoning to add objects that are not explicitly depicted (invisible) in the sketch, to generate a coherent scene.
  • Figure 2: Demonstration of the scene diffusion/denoising process of the matrix field and the spatial field. The denoising process samples from a Gaussian distribution and progressively denoises the sample for plausible and realistic scene generation. Note how the layout and 3D shapes are both simultaneously denoised.
  • Figure 3: Proposed SEK framework. (a) Sketch features are extracted by ViT and integrated with knowledge-enhanced reasoning features to form the denoising condition. The proposed 3D scene denoiser simultaneously generates plausible layouts and realistic 3D shapes in the matrix field. (b) The generated scene matrix is decoded to form the complete scene. (c) The denoising process: The scene denoiser starts from random noise and iteratively generates the scene matrix.
  • Figure 4: Qualitative comparison. Syn2Gen and ATISS perform retrieval using 3D bounding boxes. Graph-to-3d and our method perform generation but we also show the corresponding retrieval results by searching nearest neighbor of shape code for comparison. Our method performs higher quality generation with detailed shapes and better plausibility of relationships.
  • Figure 5: Demonstration of sketch & knowledge guided scene completion. In explicit completion, the sketch and user-specified entities complement each other. Beyond explicit instructions, the additional invisible entities are inferred based on knowledge and the current visible objects to generate plausible extra objects in the scene.