Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment
Ghazanfar Ali, Hong-Quan Le, Junho Kim, Seoung-won Hwang, Jae-In Hwang
TL;DR
The paper addresses delivering engaging, non-repetitive content through intelligent virtual agents in wearable mixed reality for venues like museums and botanical gardens. It presents a modular framework that fuses spatial mapping, gaze-based interaction, speech, object recognition, cloud-based chatbot services, and expressive avatar animation to create seamless experiences on resource-constrained devices. Key contributions include the design and implementation of the virtual agent framework, explicit mapping of speech content to body animations and facial emotions, and a scalable anchor-map approach to extend the MR workspace. Empirical demonstration in a botanical garden scenario shows interactive response times around 2–4 seconds after user queries (5–8 seconds total including perception and network latency), highlighting the practicality and adaptability of the approach for diverse MR devices and applications. The work suggests that cloud-enabled multimodal virtual agents can significantly enhance real-world MR experiences by combining realism, responsiveness, and flexibility across applications.
Abstract
In this paper, we present the design of a multimodal interaction framework for intelligent virtual agents in wearable mixed reality environments, especially for interactive applications at museums, botanical gardens, and similar places. These places need engaging and no-repetitive digital content delivery to maximize user involvement. An intelligent virtual agent is a promising mode for both purposes. Premises of framework is wearable mixed reality provided by MR devices supporting spatial mapping. We envisioned a seamless interaction framework by integrating potential features of spatial mapping, virtual character animations, speech recognition, gazing, domain-specific chatbot and object recognition to enhance virtual experiences and communication between users and virtual agents. By applying a modular approach and deploying computationally intensive modules on cloud-platform, we achieved a seamless virtual experience in a device with limited resources. Human-like gaze and speech interaction with a virtual agent made it more interactive. Automated mapping of body animations with the content of a speech made it more engaging. In our tests, the virtual agents responded within 2-4 seconds after the user query. The strength of the framework is flexibility and adaptability. It can be adapted to any wearable MR device supporting spatial mapping.
