Table of Contents
Fetching ...

Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, Chuang Gan

TL;DR

This work introduces Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting and utilizes foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space.

Abstract

Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Current methods, including manual design, procedural generation, diffusion-based scene generation, and large language model (LLM) guided scene design, are hindered by limitations such as excessive human effort, reliance on predefined rules or training datasets, and limited 3D spatial reasoning ability. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. In detail, we utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene. This iterative structure brings the flexibility for our method to generate or refine scenes from various starting points, such as text, floor plans, or pre-arranged environments.

Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

TL;DR

This work introduces Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting and utilizes foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space.

Abstract

Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Current methods, including manual design, procedural generation, diffusion-based scene generation, and large language model (LLM) guided scene design, are hindered by limitations such as excessive human effort, reliance on predefined rules or training datasets, and limited 3D spatial reasoning ability. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. In detail, we utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene. This iterative structure brings the flexibility for our method to generate or refine scenes from various starting points, such as text, floor plans, or pre-arranged environments.

Paper Structure

This paper contains 34 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We present Architect, a generative framework to create diverse, realistic, and complex Embodied AI scenes. Leveraging 2D diffusion models, Architect generates scenarios in an open-vocabulary manner. Here, we showcase two cases in detail: an apartment and a grocery store.
  • Figure 2: Demonstration of our pipeline that generate complex interactive environment starting from empty scenes, including Initializing, Inpainting, Visual Perception and Placing modules.
  • Figure 3: We compare $\textsc{Architect}$ with other methods in both household scenes(living room and dining room) and other non-household scenes. We only compared the household scene generated by Diffuscene due to its limitations in Figure \ref{['fig:5']} and compared with Text2Room in Figure \ref{['fig:6']}.
  • Figure 4: Two robot manipulation tasks generated in our scene setting.
  • Figure 5: Left: the robot organizes the room by pushing the chair under the table and pushing the keyboard inside the table. Right: the robot opens the fridge door, grasps the mango and puts it into the fridge, opens the kitchen-dining room door, grasps the soda can and puts it on the dining room table, and finally closes the fridge.
  • ...and 5 more figures