Table of Contents
Fetching ...

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan

TL;DR

MultiPLY addresses the gap in current multimodal LLMs by enabling active, multisensory interaction in 3D environments. It introduces the Multisensory Universe dataset and an object-centric, token-based interaction framework that couples an embodied agent with a pre-trained LLM through action and state tokens. Across object retrieval, tool use, multisensory captioning, and task decomposition, MultiPLY outperforms baselines by large margins, demonstrating the value of active data collection and modality-specific adapters. This work advances embodied AI in 3D worlds by integrating visual, audio, tactile, and thermal cues into language-driven reasoning, with scalable data collection and a practical training paradigm.

Abstract

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

TL;DR

MultiPLY addresses the gap in current multimodal LLMs by enabling active, multisensory interaction in 3D environments. It introduces the Multisensory Universe dataset and an object-centric, token-based interaction framework that couples an embodied agent with a pre-trained LLM through action and state tokens. Across object retrieval, tool use, multisensory captioning, and task decomposition, MultiPLY outperforms baselines by large margins, demonstrating the value of active data collection and modality-specific adapters. This work advances embodied AI in 3D worlds by integrating visual, audio, tactile, and thermal cues into language-driven reasoning, with scalable data collection and a practical training paradigm.

Abstract

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.
Paper Structure (30 sections, 11 figures, 5 tables)

This paper contains 30 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: We propose MultiPLY, a multisensory embodied LLM that encodes object-centric multisensory representations (e.g., visual, audio, tactile, and thermal), by deploying an embodied agent to engage with the 3D environment. MultiPLY excels at multiple tasks including multisensory captioning, question answering, dialogue, manipulation, navigation, tool use, task decomposition, and so on.
  • Figure 2: Multisensory-Universe Generation Pipelines. We first add a set of new interactive objects in the embodied environments, then prompt ChatGPT to generate diverse tasks about the environment. An embodied agent interacts with the objects to retrieve the multisensory information and construct interaction data.
  • Figure 3: Overview of our MultiPLY. We first encode the scene as an abstracted object-centric representation, while multisensory details of objects can only be unveiled when the agent executes an action and interacts with them. We devise a set of action tokens denoting the actions of agents to interact with the environment. The interaction results are appended back to the LLM via state tokens.
  • Figure 4: Qualitative Examples of our MultiPLY. MultiPLY could interact with the objects in the embodied environments and gather multisensory information.
  • Figure 5: Prompts for adding objects to the scene
  • ...and 6 more figures