Table of Contents
Fetching ...

Design and Implementation of the Transparent, Interpretable, and Multimodal (TIM) AR Personal Assistant

Erin McGowan, Joao Rulff, Sonia Castelo, Guande Wu, Shaoyu Chen, Roque Lopez, Bea Steers, Iran R. Roman, Fabio F. Dias, Jing Qian, Parikshit Solunke, Michael Middleton, Ryan McKendrick, Claudio T. Silva

TL;DR

TIM introduces a transparent, multimodal AR personal assistant that integrates perception, memory, reasoning, and an adaptive UI to deliver just-in-time task guidance while enabling comprehensive data provenance for post-hoc analysis. The system combines egocentric perception, 3D memory, and two reasoning approaches—a dependency-graph model and a random-forest model using EgoHOS features—to produce interpretable instructions anchored in the 3D environment. Real-time analytics and extensive visualization tools support debugging and retrospective evaluation of model behavior and human performance, including temporal, spatial, and physiological data streams. Domain-calibrated deployments in tactical field care and copilot monitoring demonstrate TIM’s adaptability and show how customized components can address task-specific challenges. Limitations include focus on physical tasks, lighting sensitivity, and multi-performer collaboration, with future work aimed at broader generalization, workload modeling, and expansion to layperson use cases.

Abstract

The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be readily available for post-hoc analysis to enable developers to understand performer behavior and quickly detect failures. We introduce TIM, the first end-to-end AI-enabled task guidance system in augmented reality which is capable of detecting both the user and scene as well as providing adaptable, just-in-time feedback. We discuss the system challenges and propose design solutions. We also demonstrate how TIM adapts to domain applications with varying needs, highlighting how the system components can be customized for each scenario.

Design and Implementation of the Transparent, Interpretable, and Multimodal (TIM) AR Personal Assistant

TL;DR

TIM introduces a transparent, multimodal AR personal assistant that integrates perception, memory, reasoning, and an adaptive UI to deliver just-in-time task guidance while enabling comprehensive data provenance for post-hoc analysis. The system combines egocentric perception, 3D memory, and two reasoning approaches—a dependency-graph model and a random-forest model using EgoHOS features—to produce interpretable instructions anchored in the 3D environment. Real-time analytics and extensive visualization tools support debugging and retrospective evaluation of model behavior and human performance, including temporal, spatial, and physiological data streams. Domain-calibrated deployments in tactical field care and copilot monitoring demonstrate TIM’s adaptability and show how customized components can address task-specific challenges. Limitations include focus on physical tasks, lighting sensitivity, and multi-performer collaboration, with future work aimed at broader generalization, workload modeling, and expansion to layperson use cases.

Abstract

The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be readily available for post-hoc analysis to enable developers to understand performer behavior and quickly detect failures. We introduce TIM, the first end-to-end AI-enabled task guidance system in augmented reality which is capable of detecting both the user and scene as well as providing adaptable, just-in-time feedback. We discuss the system challenges and propose design solutions. We also demonstrate how TIM adapts to domain applications with varying needs, highlighting how the system components can be customized for each scenario.

Paper Structure

This paper contains 23 sections, 6 figures.

Figures (6)

  • Figure 1: The main components of the TIM ecosystem. Tools to facilitate data collection for experiment trials and spatiotemporal analysis (A) are crucial to ensure high data quality. Tailored visual widgets to provide feedback to performers and novel interaction mechanisms (B) are needed to ensure smooth guidance throughout different tasks. State-of-the-art machine learning algorithms (C) are needed to perceive the environment and reason about the task's current state.
  • Figure 2: An overview of the TIM architecture.
  • Figure 3: The Model Output View analysis of a cooking session. To the left, the model outputs are listed vertically. To the right, the confidence matrix displays the temporal distribution of ML model output confidences across the session.
  • Figure 4: Timeline View: Performance Overview for Participant N. The Timeline Summary Matrix views depict performance across three consecutive trials under identical task conditions. Key observations include consistent task execution, decreased errors (particularly in Procedure E), increased errors in Procedure F linked to the preflight to flight phase transition, and correlations between errors and mental states. Workload summaries demonstrate enhancements in mental states, with the final trial predominantly reflecting optimal states. At the bottom, sample frames from Trials 1, 2, and 3 are displayed.
  • Figure 5: The world point cloud (left) with annotations for spatial data streams and a panorama of selected frames (right) with annotations for object detection model outputs.
  • ...and 1 more figures