Table of Contents
Fetching ...

Identifying User Goals from UI Trajectories

Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan

TL;DR

This work introduces a novel task of identifying user goals from UI trajectories, producing natural language goal descriptions from multimodal UI traces. It formalizes evaluation via task fulfillment and satisfaction, and validates the approach using inverted UI automation datasets from web (Mind2Web) and Android (AitW/AitZ) with human and multimodal model benchmarks. Experiments reveal a substantial gap between expert humans and current models (Gemini and GPT-4o), underscoring the complexity of inferring intents from UI interactions. The study lays a foundation for future work in goal-aware agents and personalization, with potential extensions to more GUI platforms and ethical considerations for user privacy.

Abstract

Identifying underlying user goals and intents has been recognized as valuable in various personalization-oriented settings, such as personalized agents, improved search responses, advertising, user analytics, and more. In this paper, we propose a new task goal identification from observed UI trajectories aiming to infer the user's detailed intentions when performing a task within UI environments. To support this task, we also introduce a novel evaluation methodology designed to assess whether two intent descriptions can be considered paraphrases within a specific UI environment. Furthermore, we demonstrate how this task can leverage datasets designed for the inverse problem of UI automation, utilizing Android and web datasets for our experiments. To benchmark this task, we compare the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro, using our proposed metric. The results reveal that both Gemini and GPT underperform relative to human performance, underscoring the challenge of the proposed task and the significant room for improvement. This work highlights the importance of goal identification within UI trajectories, providing a foundation for further exploration and advancement in this area.

Identifying User Goals from UI Trajectories

TL;DR

This work introduces a novel task of identifying user goals from UI trajectories, producing natural language goal descriptions from multimodal UI traces. It formalizes evaluation via task fulfillment and satisfaction, and validates the approach using inverted UI automation datasets from web (Mind2Web) and Android (AitW/AitZ) with human and multimodal model benchmarks. Experiments reveal a substantial gap between expert humans and current models (Gemini and GPT-4o), underscoring the complexity of inferring intents from UI interactions. The study lays a foundation for future work in goal-aware agents and personalization, with potential extensions to more GUI platforms and ethical considerations for user privacy.

Abstract

Identifying underlying user goals and intents has been recognized as valuable in various personalization-oriented settings, such as personalized agents, improved search responses, advertising, user analytics, and more. In this paper, we propose a new task goal identification from observed UI trajectories aiming to infer the user's detailed intentions when performing a task within UI environments. To support this task, we also introduce a novel evaluation methodology designed to assess whether two intent descriptions can be considered paraphrases within a specific UI environment. Furthermore, we demonstrate how this task can leverage datasets designed for the inverse problem of UI automation, utilizing Android and web datasets for our experiments. To benchmark this task, we compare the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro, using our proposed metric. The results reveal that both Gemini and GPT underperform relative to human performance, underscoring the challenge of the proposed task and the significant room for improvement. This work highlights the importance of goal identification within UI trajectories, providing a foundation for further exploration and advancement in this area.
Paper Structure (31 sections, 10 figures, 2 tables)

This paper contains 31 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An example of a user performing a flight booking task. The agent first observes the UI interactions, comprehends the task's essence, and then offers help with related tasks, like booking a hotel and blocking calendar dates. We focus on the first part, comprehending the task by observing the UI interactions.
  • Figure 2: An instance from Mind2Web, representing a full trajectory accomplishing the task description above.
  • Figure 3: An instance from AitW, representing a full trajectory accomplishing the task "Set an alarm". The blue plus sign indicates the area on the screen where the tap occurred.
  • Figure 4: An illustration in which the user chose a specific car primarily for its 12-inch feature, but since it was also the cheapest, annotators incorrectly assumed cost was the deciding factor.
  • Figure 5: Comparison of "Match" proportions between Gemini 1.5 Pro and GPT-4-Turbo models across the different domains
  • ...and 5 more figures