Table of Contents
Fetching ...

Crafting Dynamic Virtual Activities with Advanced Multimodal Models

Changyang Li, Qingan Yan, Minyoung Kim, Zhan Li, Yi Xu, Lap-Fai Yu

TL;DR

Problem: creating context-aware, dynamic multi-character activities in 3D virtual scenes is challenging due to the need to interpret complex environments and coordinate interactions. Approach: the paper leverages vision-language multimodal inputs and a Layout Chain-of-Thought prompting strategy to reason about scenes, generate high-level activity descriptions, and ground them in 3D via an MCMC-based pose optimizer. Contributions: a structured activity representation, explicit spatial-reasoning prompts, and an efficient grounding pipeline validated on apartment, restaurant, and office scenes, plus AR scans with perceptual validation. Findings: L-CoT and visual inputs improve reasoning and grounding, yielding scalable, realistic multi-character behaviors with efficiency gains over manual design. Significance: enables richer VR/AR experiences, training and storytelling workflows, and scalable content creation in future metaverse-like spaces.

Abstract

In this paper, we investigate the use of multimodal large language models (MLLMs) for generating virtual activities, leveraging the integration of vision-language modalities to enable the interpretation of virtual environments. Our approach recognizes and abstracts key scene elements including scene layouts, semantic contexts, and object identities with MLLMs' multimodal reasoning capabilities. By correlating these abstractions with massive knowledge about human activities, MLLMs are capable of generating adaptive and contextually relevant virtual activities. We propose a structured framework to articulate abstract activity descriptions, emphasizing detailed multi-character interactions within virtual spaces. Utilizing the derived high-level contexts, our approach accurately positions virtual characters and ensures that their interactions and behaviors are realistically and contextually appropriate through strategic optimization. Experiment results demonstrate the effectiveness of our approach, providing a novel direction for enhancing the realism and context-awareness in simulated virtual environments.

Crafting Dynamic Virtual Activities with Advanced Multimodal Models

TL;DR

Problem: creating context-aware, dynamic multi-character activities in 3D virtual scenes is challenging due to the need to interpret complex environments and coordinate interactions. Approach: the paper leverages vision-language multimodal inputs and a Layout Chain-of-Thought prompting strategy to reason about scenes, generate high-level activity descriptions, and ground them in 3D via an MCMC-based pose optimizer. Contributions: a structured activity representation, explicit spatial-reasoning prompts, and an efficient grounding pipeline validated on apartment, restaurant, and office scenes, plus AR scans with perceptual validation. Findings: L-CoT and visual inputs improve reasoning and grounding, yielding scalable, realistic multi-character behaviors with efficiency gains over manual design. Significance: enables richer VR/AR experiences, training and storytelling workflows, and scalable content creation in future metaverse-like spaces.

Abstract

In this paper, we investigate the use of multimodal large language models (MLLMs) for generating virtual activities, leveraging the integration of vision-language modalities to enable the interpretation of virtual environments. Our approach recognizes and abstracts key scene elements including scene layouts, semantic contexts, and object identities with MLLMs' multimodal reasoning capabilities. By correlating these abstractions with massive knowledge about human activities, MLLMs are capable of generating adaptive and contextually relevant virtual activities. We propose a structured framework to articulate abstract activity descriptions, emphasizing detailed multi-character interactions within virtual spaces. Utilizing the derived high-level contexts, our approach accurately positions virtual characters and ensures that their interactions and behaviors are realistically and contextually appropriate through strategic optimization. Experiment results demonstrate the effectiveness of our approach, providing a novel direction for enhancing the realism and context-awareness in simulated virtual environments.

Paper Structure

This paper contains 17 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: An overview of our approach. We first capture multi-view Set-of-Mark (SoM) yang2023set observations of the scene, enabling the MLLM to construct an area-based scene graph and interpret spatial layouts. Next, the MLLM generates detailed descriptions of virtual activities at discrete keyframes, specifying characters’ poses, positional references, and interactions. Finally, a 3D pose optimizer positions the virtual characters in the scene. The example shows generated activity descriptions for a selected keyframe ($\mathbf{s}_t$) and their corresponding 3D realization.
  • Figure 2: An illustration of maintaining connectivity during the selection of multi-view observations. The initial coverage set includes views $x$ and $z$ covering distinct scene objects. A candidate view $y$ is selected to connect $x$ and $z$ by capturing shared objects.
  • Figure 3: The proximity between areas captured by two camera views is determined following the Layout Chain-of-Thought (L-CoT) prompting, given their shared common objects.
  • Figure 4: The free spaces considered for different character poses during optimization: (a) standing near the furniture, (b) sitting on the predicted sitting points, and (c) lying on a supporting surface.
  • Figure 5: An example of an MLLM prompting format used in our method. The prompt explicitly guides the model to reason through character activities, spatial transitions, and interactions within the environment.
  • ...and 6 more figures