Table of Contents
Fetching ...

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

Jiahao Nick Li, Yan Xu, Tovi Grossman, Stephanie Santosa, Michelle Li

TL;DR

OmniActions tackles friction in pervasive AR by predicting user follow-up actions from real-world multimodal inputs using an LLM-based pipeline grounded in a data-derived design space. The approach is evaluated through a five-day diary study with 39 participants and compares three LLM techniques, with in-context learning using Chain-of-Thought prompts (GPT-4) achieving up to 94.3% accuracy for top-3 general actions. A mobile proof-of-concept prototype demonstrates real-time multimodal reasoning and actionable predictions, supplemented by preliminary user feedback highlighting both potential and usability challenges. The work advances proactive multimodal interaction by grounding predictions in empirical action design and showing how CoT reasoning enhances explainability and performance, while outlining avenues for online adaptation and personalized AR experiences.

Abstract

The progression to "Pervasive Augmented Reality" envisions easy access to multimodal information continuously. However, in many everyday scenarios, users are occupied physically, cognitively or socially. This may increase the friction to act upon the multimodal information that users encounter in the world. To reduce such friction, future interactive interfaces should intelligently provide quick access to digital actions based on users' context. To explore the range of possible digital actions, we conducted a diary study that required participants to capture and share the media that they intended to perform actions on (e.g., images or audio), along with their desired actions and other contextual information. Using this data, we generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information grounded in the derived design space. Using the empirical data collected in the diary study, we performed quantitative evaluations on three variations of LLM techniques (intent classification, in-context learning and finetuning) and identified the most effective technique for our task. Additionally, as an instantiation of the pipeline, we developed an interactive prototype and reported preliminary user feedback about how people perceive and react to the action predictions and its errors.

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

TL;DR

OmniActions tackles friction in pervasive AR by predicting user follow-up actions from real-world multimodal inputs using an LLM-based pipeline grounded in a data-derived design space. The approach is evaluated through a five-day diary study with 39 participants and compares three LLM techniques, with in-context learning using Chain-of-Thought prompts (GPT-4) achieving up to 94.3% accuracy for top-3 general actions. A mobile proof-of-concept prototype demonstrates real-time multimodal reasoning and actionable predictions, supplemented by preliminary user feedback highlighting both potential and usability challenges. The work advances proactive multimodal interaction by grounding predictions in empirical action design and showing how CoT reasoning enhances explainability and performance, while outlining avenues for online adaptation and personalized AR experiences.

Abstract

The progression to "Pervasive Augmented Reality" envisions easy access to multimodal information continuously. However, in many everyday scenarios, users are occupied physically, cognitively or socially. This may increase the friction to act upon the multimodal information that users encounter in the world. To reduce such friction, future interactive interfaces should intelligently provide quick access to digital actions based on users' context. To explore the range of possible digital actions, we conducted a diary study that required participants to capture and share the media that they intended to perform actions on (e.g., images or audio), along with their desired actions and other contextual information. Using this data, we generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information grounded in the derived design space. Using the empirical data collected in the diary study, we performed quantitative evaluations on three variations of LLM techniques (intent classification, in-context learning and finetuning) and identified the most effective technique for our task. Additionally, as an instantiation of the pipeline, we developed an interactive prototype and reported preliminary user feedback about how people perceive and react to the action predictions and its errors.
Paper Structure (84 sections, 1 equation, 14 figures, 6 tables)

This paper contains 84 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: The development process for OmniActions. (a) An internal workshop was conducted to (b) generate informative examples of situations when users may take using multimodal information. (c) The examples were used to inform and inspire the participants during a diary study that (d) collected data when participants wished to take action using multimodal data. (e) The follow-up actions submitted by participants were then analyzed and categorized into a design space. (f) The collected data included contextual information that was used to train a prediction model that was (g) integrated within OmniActions to predict multiple follow-up actions given multimodal information.
  • Figure 1: Confusion matrices for predicting dominant only and intent classification.
  • Figure 2: Screenshots from the formative workshop where participants shared data in Session 1, reviewed other participants' data in Session 2, and grouped similar actions in Session 3.
  • Figure 2: Confusion matrices for finetuning and in-context learning.
  • Figure 3: Frequencies of the 13 follow-up actions generated during the workshop (n = 170) that were grouped into 4 categories.
  • ...and 9 more figures