External Knowledge Enhanced 3D Scene Generation from Sketch
Zijie Wu, Mingtao Feng, Yaonan Wang, He Xie, Weisheng Dong, Bo Miao, Ajmal Mian
TL;DR
This work addresses the challenge of generating realistic and diverse $3D$ scenes from sparse hand-drawn sketches. It introduces SEK, a diffusion-based framework conditioned on sketch cues and an external knowledge base of object relationships, augmented by a knowledge-enhanced graph reasoning module and a spectrum-filtered 3D denoiser. Key contributions include (i) constructing a relational knowledge base $KB=(\mathcal{V},\mathcal{R},p)$ from indoor scenes, (ii) a KeGR module that propagates knowledge through multistep graph convolutions to produce a graph feature $H^G$, (iii) a ViT-based sketch encoder that, together with $H^G$, forms the conditioning $c=[H^S,H^G]$ for diffusion, and (iv) a spectrum-filtered 3D scene denoiser that suppresses padding and enhances object signals. Experiments on the 3D-FRONT dataset show state-of-the-art performance in $3D$ scene generation and completion, with notable cross-dataset transferability to ScanNet, validating the practical impact of combining sketch guidance with external knowledge for controllable, plausible scene synthesis.
Abstract
Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries.We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution.We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41%, 37.18% in 3D scene generation and FID, KID by 19.12%, 20.06% in 3D scene completion compared to the nearest competitor DiffuScene.
