Crafting Dynamic Virtual Activities with Advanced Multimodal Models
Changyang Li, Qingan Yan, Minyoung Kim, Zhan Li, Yi Xu, Lap-Fai Yu
TL;DR
Problem: creating context-aware, dynamic multi-character activities in 3D virtual scenes is challenging due to the need to interpret complex environments and coordinate interactions. Approach: the paper leverages vision-language multimodal inputs and a Layout Chain-of-Thought prompting strategy to reason about scenes, generate high-level activity descriptions, and ground them in 3D via an MCMC-based pose optimizer. Contributions: a structured activity representation, explicit spatial-reasoning prompts, and an efficient grounding pipeline validated on apartment, restaurant, and office scenes, plus AR scans with perceptual validation. Findings: L-CoT and visual inputs improve reasoning and grounding, yielding scalable, realistic multi-character behaviors with efficiency gains over manual design. Significance: enables richer VR/AR experiences, training and storytelling workflows, and scalable content creation in future metaverse-like spaces.
Abstract
In this paper, we investigate the use of multimodal large language models (MLLMs) for generating virtual activities, leveraging the integration of vision-language modalities to enable the interpretation of virtual environments. Our approach recognizes and abstracts key scene elements including scene layouts, semantic contexts, and object identities with MLLMs' multimodal reasoning capabilities. By correlating these abstractions with massive knowledge about human activities, MLLMs are capable of generating adaptive and contextually relevant virtual activities. We propose a structured framework to articulate abstract activity descriptions, emphasizing detailed multi-character interactions within virtual spaces. Utilizing the derived high-level contexts, our approach accurately positions virtual characters and ensures that their interactions and behaviors are realistically and contextually appropriate through strategic optimization. Experiment results demonstrate the effectiveness of our approach, providing a novel direction for enhancing the realism and context-awareness in simulated virtual environments.
