Table of Contents
Fetching ...

EmBARDiment: an Embodied AI Agent for Productivity in XR

Riccardo Bovo, Steven Abreu, Karan Ahuja, Eric J Gonzalez, Li-Te Cheng, Mar Gonzalez-Franco

TL;DR

EmBARDiment tackles the inefficiency of explicit prompts in XR AI agents by introducing a gaze-driven, memory-augmented attention framework that implicitly derives user context from eye-gaze and episodic memory. The system embeds an embodied AI agent in a multi-window XR environment, combining speech, gaze, OCR-based text extraction, and a 250-word contextual memory to ground responses generated by a large language model. A within-subject user study compares Baseline, Full Context, and Eye-Tracking conditions, showing reduced question reformulations and higher satisfaction/helpfulness for Eye-Tracking, while Full Context can degrade accuracy due to overabundant context. The work provides concrete design considerations for multimodal XR agents, highlighting the benefits of implicit attention, contextual memory, and context-following embodiment in improving productivity and natural interaction with AI in XR.

Abstract

XR devices running chat-bots powered by Large Language Models (LLMs) have the to become always-on agents that enable much better productivity scenarios. Current screen based chat-bots do not take advantage of the the full-suite of natural inputs available in XR, including inward facing sensor data, instead they over-rely on explicit voice or text prompts, sometimes paired with multi-modal data dropped as part of the query. We propose a solution that leverages an attention framework that derives context implicitly from user actions, eye-gaze, and contextual memory within the XR environment. Our work minimizes the need for engineered explicit prompts, fostering grounded and intuitive interactions that glean user insights for the chat-bot.

EmBARDiment: an Embodied AI Agent for Productivity in XR

TL;DR

EmBARDiment tackles the inefficiency of explicit prompts in XR AI agents by introducing a gaze-driven, memory-augmented attention framework that implicitly derives user context from eye-gaze and episodic memory. The system embeds an embodied AI agent in a multi-window XR environment, combining speech, gaze, OCR-based text extraction, and a 250-word contextual memory to ground responses generated by a large language model. A within-subject user study compares Baseline, Full Context, and Eye-Tracking conditions, showing reduced question reformulations and higher satisfaction/helpfulness for Eye-Tracking, while Full Context can degrade accuracy due to overabundant context. The work provides concrete design considerations for multimodal XR agents, highlighting the benefits of implicit attention, contextual memory, and context-following embodiment in improving productivity and natural interaction with AI in XR.

Abstract

XR devices running chat-bots powered by Large Language Models (LLMs) have the to become always-on agents that enable much better productivity scenarios. Current screen based chat-bots do not take advantage of the the full-suite of natural inputs available in XR, including inward facing sensor data, instead they over-rely on explicit voice or text prompts, sometimes paired with multi-modal data dropped as part of the query. We propose a solution that leverages an attention framework that derives context implicitly from user actions, eye-gaze, and contextual memory within the XR environment. Our work minimizes the need for engineered explicit prompts, fostering grounded and intuitive interactions that glean user insights for the chat-bot.
Paper Structure (39 sections, 9 figures, 2 tables)

This paper contains 39 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Schema of EmBARDiment. The attention frameworks leverages implicit eye-gaze to select contextual information and bundles it with explicit verbal inputs. This elicits grounded communication between the User and the AI Agent.
  • Figure 2: Experiment Conditions. (A) Baseline: no contextual information selected. (B) Full-Context: All the contextual information are selected. (C) Eye-gaze: information are selected based on eye-gaze fixations.
  • Figure 3: Experiment layout, as seen by the participants during the experiment. The 3 texts are fixed for all participants to maintain stimuli consistency. The layout of windows spawned 1 meter away from the user's head, spanning 120° (60° on the left and 60° on the right), and a resolution of 700x1200 px. A 3D model of the layout with reference textures is available at the following GitHub repository: https://emBARDiment.github.io.
  • Figure 4: Box plot comparing participants' attempts. Each participant answered two questions, with up to five attempts per question. Significant differences highlighted: $$ * p $<$ .05, ** p $<$ .01, *** p $<$ .001.
  • Figure 5: Sankey diagram depicting the success rate at each subsequent attempt (i.e., up to five attempts per question).
  • ...and 4 more figures