EmBARDiment: an Embodied AI Agent for Productivity in XR

Riccardo Bovo; Steven Abreu; Karan Ahuja; Eric J Gonzalez; Li-Te Cheng; Mar Gonzalez-Franco

EmBARDiment: an Embodied AI Agent for Productivity in XR

Riccardo Bovo, Steven Abreu, Karan Ahuja, Eric J Gonzalez, Li-Te Cheng, Mar Gonzalez-Franco

TL;DR

EmBARDiment tackles the inefficiency of explicit prompts in XR AI agents by introducing a gaze-driven, memory-augmented attention framework that implicitly derives user context from eye-gaze and episodic memory. The system embeds an embodied AI agent in a multi-window XR environment, combining speech, gaze, OCR-based text extraction, and a 250-word contextual memory to ground responses generated by a large language model. A within-subject user study compares Baseline, Full Context, and Eye-Tracking conditions, showing reduced question reformulations and higher satisfaction/helpfulness for Eye-Tracking, while Full Context can degrade accuracy due to overabundant context. The work provides concrete design considerations for multimodal XR agents, highlighting the benefits of implicit attention, contextual memory, and context-following embodiment in improving productivity and natural interaction with AI in XR.

Abstract

XR devices running chat-bots powered by Large Language Models (LLMs) have the to become always-on agents that enable much better productivity scenarios. Current screen based chat-bots do not take advantage of the the full-suite of natural inputs available in XR, including inward facing sensor data, instead they over-rely on explicit voice or text prompts, sometimes paired with multi-modal data dropped as part of the query. We propose a solution that leverages an attention framework that derives context implicitly from user actions, eye-gaze, and contextual memory within the XR environment. Our work minimizes the need for engineered explicit prompts, fostering grounded and intuitive interactions that glean user insights for the chat-bot.

EmBARDiment: an Embodied AI Agent for Productivity in XR

TL;DR

Abstract

Paper Structure (39 sections, 9 figures, 2 tables)

This paper contains 39 sections, 9 figures, 2 tables.

Introduction
Related work
Context-Aware Assistants for Productivity in XR
Gaze attention driven Multimodal XR Interactions
From Chat-bots to Embodied XR Agents
EmBARDiment
Embodied AI Agent
Multimodal Interaction
Gaze-Driven Contextual Memory
Experiment
Participants
Design
Q&A Reading Task
Multi Window Layout
Procedure
...and 24 more sections

Figures (9)

Figure 1: Schema of EmBARDiment. The attention frameworks leverages implicit eye-gaze to select contextual information and bundles it with explicit verbal inputs. This elicits grounded communication between the User and the AI Agent.
Figure 2: Experiment Conditions. (A) Baseline: no contextual information selected. (B) Full-Context: All the contextual information are selected. (C) Eye-gaze: information are selected based on eye-gaze fixations.
Figure 3: Experiment layout, as seen by the participants during the experiment. The 3 texts are fixed for all participants to maintain stimuli consistency. The layout of windows spawned 1 meter away from the user's head, spanning 120° (60° on the left and 60° on the right), and a resolution of 700x1200 px. A 3D model of the layout with reference textures is available at the following GitHub repository: https://emBARDiment.github.io.
Figure 4: Box plot comparing participants' attempts. Each participant answered two questions, with up to five attempts per question. Significant differences highlighted: $$ * p $<$ .05, ** p $<$ .01, *** p $<$ .001.
Figure 5: Sankey diagram depicting the success rate at each subsequent attempt (i.e., up to five attempts per question).
...and 4 more figures

EmBARDiment: an Embodied AI Agent for Productivity in XR

TL;DR

Abstract

EmBARDiment: an Embodied AI Agent for Productivity in XR

Authors

TL;DR

Abstract

Table of Contents

Figures (9)