EmBARDiment: an Embodied AI Agent for Productivity in XR
Riccardo Bovo, Steven Abreu, Karan Ahuja, Eric J Gonzalez, Li-Te Cheng, Mar Gonzalez-Franco
TL;DR
EmBARDiment tackles the inefficiency of explicit prompts in XR AI agents by introducing a gaze-driven, memory-augmented attention framework that implicitly derives user context from eye-gaze and episodic memory. The system embeds an embodied AI agent in a multi-window XR environment, combining speech, gaze, OCR-based text extraction, and a 250-word contextual memory to ground responses generated by a large language model. A within-subject user study compares Baseline, Full Context, and Eye-Tracking conditions, showing reduced question reformulations and higher satisfaction/helpfulness for Eye-Tracking, while Full Context can degrade accuracy due to overabundant context. The work provides concrete design considerations for multimodal XR agents, highlighting the benefits of implicit attention, contextual memory, and context-following embodiment in improving productivity and natural interaction with AI in XR.
Abstract
XR devices running chat-bots powered by Large Language Models (LLMs) have the to become always-on agents that enable much better productivity scenarios. Current screen based chat-bots do not take advantage of the the full-suite of natural inputs available in XR, including inward facing sensor data, instead they over-rely on explicit voice or text prompts, sometimes paired with multi-modal data dropped as part of the query. We propose a solution that leverages an attention framework that derives context implicitly from user actions, eye-gaze, and contextual memory within the XR environment. Our work minimizes the need for engineered explicit prompts, fostering grounded and intuitive interactions that glean user insights for the chat-bot.
