Table of Contents
Fetching ...

MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text

Takayuki Hara, Tatsuya Harada

TL;DR

The proposed method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.

Abstract

The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications. Previous studies required significant effort to realize the desired scene, owing to limited control conditions. We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts. Combining these conditions to generate a 3D scene involves the following significant difficulties: (1) the creation of large datasets, (2) reflection on the interaction of multimodal conditions, and (3) domain dependence of the layout conditions. We decompose the process of 3D scene generation into 2D image generation from the given conditions and 3D scene generation from 2D images. 2D image generation is achieved by fine-tuning a pretrained text-to-image model with a small artificial dataset of partial images and layouts, and 3D scene generation is achieved by layout-conditioned depth estimation and neural radiance fields (NeRF), thereby avoiding the creation of large datasets. The use of a common representation of spatial information using 360-degree images allows for the consideration of multimodal condition interactions and reduces the domain dependence of the layout control. The experimental results qualitatively and quantitatively demonstrated that the proposed method can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.

MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text

TL;DR

The proposed method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.

Abstract

The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications. Previous studies required significant effort to realize the desired scene, owing to limited control conditions. We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts. Combining these conditions to generate a 3D scene involves the following significant difficulties: (1) the creation of large datasets, (2) reflection on the interaction of multimodal conditions, and (3) domain dependence of the layout conditions. We decompose the process of 3D scene generation into 2D image generation from the given conditions and 3D scene generation from 2D images. 2D image generation is achieved by fine-tuning a pretrained text-to-image model with a small artificial dataset of partial images and layouts, and 3D scene generation is achieved by layout-conditioned depth estimation and neural radiance fields (NeRF), thereby avoiding the creation of large datasets. The use of a common representation of spatial information using 360-degree images allows for the consideration of multimodal condition interactions and reduces the domain dependence of the layout control. The experimental results qualitatively and quantitatively demonstrated that the proposed method can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.
Paper Structure (37 sections, 20 equations, 23 figures, 6 tables)

This paper contains 37 sections, 20 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: From a given partial image, layout information represented in top view, and text prompts, our method generates a 3D scene represented by the 360-degree RGB-D, and NeRF. Free perspective views can be rendered from the NeRF model.
  • Figure 2: Overview of the proposed method to generate 360-degree RGB-D and NeRF models from a partial image, layouts and text prompts. (a) The partial image is converted to an ERP image from the observer position with the specified direction and field-of-view (FoV). The layout represented the in top view is converted to a coarse depth and a semantic map in ERP format with the observer position as the projection center. (b) These ERP images and texts are combined to generate a 360-degree RGB. (c) The generated RGB is combined with the coarse depth to estimate the fine depth. (d) a NeRF model is trained from 360-degree RGB-D.
  • Figure 3: The case of using a terrain map for the layout format. The partial image and the terrain map are converted into ERP images from the observer's viewpoint, respectively.
  • Figure 4: The pipeline of generating 360-degree RGB from a partial image, coarse depth map, semantic map, and text prompts.
  • Figure 5: Semantic map. Regions related to objects are extracted, excluding regions derived from the shape of the room, such as walls, floor, and ceiling, which are enclosed in a bounding box to form a semantic map in the proposed method.
  • ...and 18 more figures