Table of Contents
Fetching ...

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Zhenhong Sun, Yang Li, Pan Ji, Hongdong Li

TL;DR

Sketch2Scene tackles the challenge of generating large-scale playable 3D game scenes from casual sketches by leveraging a pre-trained 2D diffusion model to produce an isometric reference, then extracting a basemap and foreground layout through a Visual Scene Understanding module and finally a procedural 3D generation pipeline that places assets in a Unity scene. The method introduces a SAL-enhanced ControlNet for sketch-conditioned 2D isometric generation and a step-unrolled denoising diffusion inpainting to produce clean basemaps, enabling effective 3D scene reconstruction despite limited 3D training data. Key contributions include a three-module pipeline, a specialized isometric basemap inpainting framework, a learning-based scene-understanding module (heightmap, splatmap, object placement), and a practical end-to-end route to interactive 3D scenes compatible with game engines. The approach demonstrates high-quality, controllable, and playable 3D scenes, with broader implications for rapid game-world prototyping and content creation while acknowledging limitations in pipeline complexity and texture diversity.

Abstract

3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user's casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user's design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user's intention.

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

TL;DR

Sketch2Scene tackles the challenge of generating large-scale playable 3D game scenes from casual sketches by leveraging a pre-trained 2D diffusion model to produce an isometric reference, then extracting a basemap and foreground layout through a Visual Scene Understanding module and finally a procedural 3D generation pipeline that places assets in a Unity scene. The method introduces a SAL-enhanced ControlNet for sketch-conditioned 2D isometric generation and a step-unrolled denoising diffusion inpainting to produce clean basemaps, enabling effective 3D scene reconstruction despite limited 3D training data. Key contributions include a three-module pipeline, a specialized isometric basemap inpainting framework, a learning-based scene-understanding module (heightmap, splatmap, object placement), and a practical end-to-end route to interactive 3D scenes compatible with game engines. The approach demonstrates high-quality, controllable, and playable 3D scenes, with broader implications for rapid game-world prototyping and content creation while acknowledging limitations in pipeline complexity and texture diversity.

Abstract

3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user's casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user's design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user's intention.
Paper Structure (28 sections, 7 equations, 16 figures, 1 algorithm)

This paper contains 28 sections, 7 equations, 16 figures, 1 algorithm.

Figures (16)

  • Figure 1: Overview of the pipeline of the proposed method. The input user sketch and text prompt are fed into our pre-trained ControlNet that generates a 2D isometric reference image. Our Scene-Understanding module then extracts the foreground object masks. The masks are fed to a pre-trained inpainting model which generates the isometric empty basemap (i.e., the background terrain with no objects). The scene understanding module also computes the heightmap, texture splatmap and object instance pose. Finally, a procedural 3D scene generation module is employed to generate and render the 3D game scene.
  • Figure 2: The Sketch-Aware Loss (SAL) facilitates ControlNet's training with a single ground truth image associated with diverse sketches generated through random category filtering, thereby enhancing its performance on flexible sketches.
  • Figure 3: Object footprint estimation, showing an illustrative example of obtaining a building footprint and height. On the left: Black region is the instance mask of a building, red box shows the homography-warped 2D object bounding box, blue box shows the estimated object footprint. On the right: Blue filled box shows the inverse-homography-warped object footprint, which can also be used to estimate the object height.
  • Figure 4: Results showing the generated isometric reference images (column-2), along with the inpainted basemaps (column-3). Sketch color codes: blue=water, yellow=building, orange=bridge, gray=roads, and green=trees.
  • Figure 5: Basemap inpainting results of SDXL-Inpaint (middle) and ours (right) on the isometric images (left)
  • ...and 11 more figures