iControl3D: An Interactive System for Controllable 3D Scene Generation
Xingyi Li, Yizheng Wu, Jun Cen, Juewen Peng, Kewei Wang, Ke Xian, Zhe Wang, Zhiguo Cao, Guosheng Lin
TL;DR
The paper tackles the challenge of controllable, scalable 3D scene generation by weaving together a three-component system: a 3D creator interface for fine-grained user control, a generative RGB-D fusion pipeline that iteratively builds a cohesive 3D mesh from 2D diffusion outputs, and a neural rendering interface that supports online NeRF-based navigation and video rendering. It introduces boundary-aware depth alignment to smooth depth transitions at mesh boundaries and uses environment maps to model remote outdoor content, improving outdoor scene realism. By integrating ControlNet-style conditioning, the system enables scribbles, segmentation, and depth inputs to steer diffusion outputs toward user intent. Extensive experiments and a user study demonstrate superior quality and diversity compared with strong baselines, highlighting the practical potential of interactive, diffusion-guided 3D scene creation. The work advances accessible, high-fidelity 3D content creation with real-time controllability and view-consistent rendering.
Abstract
3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at https://github.com/xingyi-li/iControl3D.
