Table of Contents
Fetching ...

Context-Aware Indoor Point Cloud Object Generation through User Instructions

Yiyang Luo, Ke Lin, Chao Gu

TL;DR

This work tackles the problem of modifying indoor 3D scenes by generating contextually integrated point-cloud objects driven by natural language. It introduces an end-to-end multi-modal framework built on a GPT-aided data pipeline and a diffusion-based generator derived from Point-E, enhanced with cross-modal fusion, quantized position prediction, and CLIP-guided context alignment. The Nr3D-SA and Sr3D-SA datasets, created by paraphrasing ReferIt3D descriptions, enable rich instruction-conditioned generation and evaluation, including visual grounding metrics. Across comprehensive experiments, the approach demonstrates realistic object generation, diverse outputs, and effective integration with surrounding geometry, with ablations confirming the value of each component. The method holds promise for AR/VR scene editing and data augmentation in downstream tasks such as visual grounding and immersive environment creation.

Abstract

Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our model revolutionizes scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.

Context-Aware Indoor Point Cloud Object Generation through User Instructions

TL;DR

This work tackles the problem of modifying indoor 3D scenes by generating contextually integrated point-cloud objects driven by natural language. It introduces an end-to-end multi-modal framework built on a GPT-aided data pipeline and a diffusion-based generator derived from Point-E, enhanced with cross-modal fusion, quantized position prediction, and CLIP-guided context alignment. The Nr3D-SA and Sr3D-SA datasets, created by paraphrasing ReferIt3D descriptions, enable rich instruction-conditioned generation and evaluation, including visual grounding metrics. Across comprehensive experiments, the approach demonstrates realistic object generation, diverse outputs, and effective integration with surrounding geometry, with ablations confirming the value of each component. The method holds promise for AR/VR scene editing and data augmentation in downstream tasks such as visual grounding and immersive environment creation.

Abstract

Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our model revolutionizes scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.
Paper Structure (32 sections, 9 equations, 6 figures, 4 tables)

This paper contains 32 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our model generates a couch that is positioned close to the television in response to the query and makes it consistent with the rest of the scene, i.e., the orientation, size, and overlap with other objects in certain cases.
  • Figure 2: Overview of our method. (a) A large language model (LLM) is used to paraphrase the descriptive text, combined with rule-based and manual corrections. (b) Upon receiving generative text as a query and point cloud input, our model integrates both object and language features to predict the final position. Besides, the language features are aligned across the model. The amalgamated features are then processed through the Point-E model to generate a realistic object.
  • Figure 3: Extraction of context vector $\bm{z}_{ctx}$.
  • Figure 4: Scenes before and after modification. Each row represents the scenes to be modified under different instructions. Different random seeds are used to generate the columns of the modified scene. Candidate locations are extracted from the Top-5 predictions. The bounding boxes of reference objects and generated objects are outlined in blue and red, respectively.
  • Figure 5: Diversity. The leftmost column shows the category of the generated object to be generated from the instruction. Different generations under the same instruction are shown in each row.
  • ...and 1 more figures