Table of Contents
Fetching ...

iControl3D: An Interactive System for Controllable 3D Scene Generation

Xingyi Li, Yizheng Wu, Jun Cen, Juewen Peng, Kewei Wang, Ke Xian, Zhe Wang, Zhiguo Cao, Guosheng Lin

TL;DR

The paper tackles the challenge of controllable, scalable 3D scene generation by weaving together a three-component system: a 3D creator interface for fine-grained user control, a generative RGB-D fusion pipeline that iteratively builds a cohesive 3D mesh from 2D diffusion outputs, and a neural rendering interface that supports online NeRF-based navigation and video rendering. It introduces boundary-aware depth alignment to smooth depth transitions at mesh boundaries and uses environment maps to model remote outdoor content, improving outdoor scene realism. By integrating ControlNet-style conditioning, the system enables scribbles, segmentation, and depth inputs to steer diffusion outputs toward user intent. Extensive experiments and a user study demonstrate superior quality and diversity compared with strong baselines, highlighting the practical potential of interactive, diffusion-guided 3D scene creation. The work advances accessible, high-fidelity 3D content creation with real-time controllability and view-consistent rendering.

Abstract

3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at https://github.com/xingyi-li/iControl3D.

iControl3D: An Interactive System for Controllable 3D Scene Generation

TL;DR

The paper tackles the challenge of controllable, scalable 3D scene generation by weaving together a three-component system: a 3D creator interface for fine-grained user control, a generative RGB-D fusion pipeline that iteratively builds a cohesive 3D mesh from 2D diffusion outputs, and a neural rendering interface that supports online NeRF-based navigation and video rendering. It introduces boundary-aware depth alignment to smooth depth transitions at mesh boundaries and uses environment maps to model remote outdoor content, improving outdoor scene realism. By integrating ControlNet-style conditioning, the system enables scribbles, segmentation, and depth inputs to steer diffusion outputs toward user intent. Extensive experiments and a user study demonstrate superior quality and diversity compared with strong baselines, highlighting the practical potential of interactive, diffusion-guided 3D scene creation. The work advances accessible, high-fidelity 3D content creation with real-time controllability and view-consistent rendering.

Abstract

3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at https://github.com/xingyi-li/iControl3D.
Paper Structure (25 sections, 7 equations, 15 figures, 5 tables)

This paper contains 25 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Our system empowers users to generate and render customizable 3D scenes with precise control over the 3D scene generation process. With our system, users can actively participate in the 3D scene creation process. For example, they can (a) manipulate the virtual camera to any viewpoint, (b) adjust the size of the selection box to generate global and local content, and try different random seeds to generate various results. (c) Besides text prompts, users can achieve fine-grained control over the output by adding extra conditions such as scribbles, semantic segmentation maps, and depth. (d) After generating 3D scenes, they can navigate the entire scene and create camera trajectories to render videos according to their preferences.
  • Figure 2: System overview. (I) Within our 3D creator interface, users are allowed to manipulate the camera to any viewpoint, adjust the size of the selection box to generate local content, and try different random seeds to create a variety of results. Moreover, users can achieve fine-grained control over the generation process by adding extra conditions such as user scribbles; (II) Once the generated result in (I) is accepted by the users, our generative RGB-D fusion module fuses it with the existing mesh. This alternating process between (I) and (II) continues until a satisfactory 3D structure is obtained; (III) After generating 3D scenes, our neural rendering interface then builds a radiance field online and enables users to navigate the entire scene. By recording their virtual journey through the scene, users can also produce high-quality videos that showcase the intricacies and beauty of their designs.
  • Figure 3: Boundary-aware depth alignment. Directly combining the rendered depth $\hat{\bm{D}}_{t+1}$ and the predicted depth $\tilde{\bm{D}}_{t+1}$ leads to abrupt transitions in the combined depth while our boundary-aware depth alignment ensures a more seamless depth fusion.
  • Figure 4: Fine-grained control. Compared to (a) current text-driven methods hollein2023text2roomfridman2023scenescape, (b) our system can achieve fine-grained control over the output by adding extra conditions such as scribbles, depth, and semantic segmentation maps.
  • Figure 5: Effectiveness of boundary-aware depth alignment. Without boundary-aware depth alignment, the generated mesh may exhibit abrupt transitions at the boundaries between the newly generated content and the existing mesh.
  • ...and 10 more figures