Table of Contents
Fetching ...

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

Chengyuan Xu, Radha Kumaran, Noah Stier, Kangyou Yu, Tobias Höllerer

TL;DR

The paper presents a fast multimodal 3D reconstruction pipeline that fuses language-aware CLIP features with geometric TSDF representations to produce semantically and linguistically informed 3D scenes for AR. By introducing in-situ learning, the system enables user-guided, on-device refinement of object identities and behaviors, leveraging a graph-based representation of objects and dynamic graph CNNs to track changes across scans. Two Magic Leap 2 AR demos demonstrate spatial search via natural language and an intelligent object inventory that highlights unchanged or missing items over time, supported by a volumetric diff visualization. The approach advances open-vocabulary 3D perception and context-aware AR interfaces, with open-source code and data to foster further research in spatially aware AI and real-time AR interactions.

Abstract

Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose "in-situ" machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (https://github.com/cy-xu/spatially_aware_AI) to encourage further exploration and research in spatially aware AI.

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

TL;DR

The paper presents a fast multimodal 3D reconstruction pipeline that fuses language-aware CLIP features with geometric TSDF representations to produce semantically and linguistically informed 3D scenes for AR. By introducing in-situ learning, the system enables user-guided, on-device refinement of object identities and behaviors, leveraging a graph-based representation of objects and dynamic graph CNNs to track changes across scans. Two Magic Leap 2 AR demos demonstrate spatial search via natural language and an intelligent object inventory that highlights unchanged or missing items over time, supported by a volumetric diff visualization. The approach advances open-vocabulary 3D perception and context-aware AR interfaces, with open-source code and data to foster further research in spatially aware AI and real-time AR interactions.

Abstract

Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose "in-situ" machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (https://github.com/cy-xu/spatially_aware_AI) to encourage further exploration and research in spatially aware AI.
Paper Structure (16 sections, 2 equations, 5 figures, 1 table)

This paper contains 16 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: System overview.
  • Figure 2: During the real-time in-situ training, we sample a sparse graph from an object's voxel representation stochastically, with the voxel location's CLIP feature as the node attribute. This design choice converts the challenging irregular 3D object classification problem into a simpler graph classification problem, which enables us to identify physical objects across multiple scans of the space. The sofa graph above is oversimplified for visualization purposes.
  • Figure 3: Two prototype applications developed on Magic Leap 2 AR headset, demonstrating the potential of the proposed multimodal 3D fusion pipeline and "in-situ" machine learning for real-world scenarios.
  • Figure 4: A flowchart describing how the Scene Manager and the user can build an intelligent object inventory with the in-situ learning model. After multimodal 3D fusion and post-processing, individual objects are passed through the in-situ model to re-identify previously existing objects and eventually reveal missing objects.
  • Figure 5: From object names to abstract natural-language queries, we show more examples of spatial search in a real-world environment.