Table of Contents
Fetching ...

From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts

Zhuohao Jerry Zhang, Eldon Schoop, Jeffrey Nichols, Anuj Mahajan, Amanda Swearngin

TL;DR

This work tackles the safety gap in autonomous AI agents operating mobile UIs by developing a comprehensive taxonomy of UI action impacts and validating it through a data synthesis study that yields realistic action traces with potential real-world consequences. It then systematically evaluates several large language models under multiple prompting strategies (including KAP and CoT) to assess their ability to reason about and classify UI action impacts, revealing substantial gaps and overestimation tendencies. The contributions include a 10-by-35-category taxonomy, a dataset of 250 synthesized traces plus annotated external data, and an empirical evaluation showing that even the best-performing models struggle to reliably understand nuanced impact categories. The findings underscore the need for safer, policy-driven AI agent design and further refinement of models, data, and UI designs to manage the consequences of autonomous UI actions in practical settings.

Abstract

With advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions-particularly those that may be risky or irreversible-remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of workshops with domain experts. Following this, we conducted a data synthesis study to gather realistic mobile UI screen traces and action data that users perceive as impactful. We then used our impact categories to annotate our collected data and data repurposed from existing mobile UI navigation datasets. Our quantitative evaluations of different large language models (LLMs) and variants demonstrate how well different LLMs can understand the impacts of mobile UI actions that might be taken by an agent. We show that our taxonomy enhances the reasoning capabilities of these LLMs for understanding the impacts of mobile UI actions, but our findings also reveal significant gaps in their ability to reliably classify more nuanced or complex categories of impact.

From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts

TL;DR

This work tackles the safety gap in autonomous AI agents operating mobile UIs by developing a comprehensive taxonomy of UI action impacts and validating it through a data synthesis study that yields realistic action traces with potential real-world consequences. It then systematically evaluates several large language models under multiple prompting strategies (including KAP and CoT) to assess their ability to reason about and classify UI action impacts, revealing substantial gaps and overestimation tendencies. The contributions include a 10-by-35-category taxonomy, a dataset of 250 synthesized traces plus annotated external data, and an empirical evaluation showing that even the best-performing models struggle to reliably understand nuanced impact categories. The findings underscore the need for safer, policy-driven AI agent design and further refinement of models, data, and UI designs to manage the consequences of autonomous UI actions in practical settings.

Abstract

With advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions-particularly those that may be risky or irreversible-remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of workshops with domain experts. Following this, we conducted a data synthesis study to gather realistic mobile UI screen traces and action data that users perceive as impactful. We then used our impact categories to annotate our collected data and data repurposed from existing mobile UI navigation datasets. Our quantitative evaluations of different large language models (LLMs) and variants demonstrate how well different LLMs can understand the impacts of mobile UI actions that might be taken by an agent. We show that our taxonomy enhances the reasoning capabilities of these LLMs for understanding the impacts of mobile UI actions, but our findings also reveal significant gaps in their ability to reliably classify more nuanced or complex categories of impact.

Paper Structure

This paper contains 50 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The detailed categories, definitions, and examples of the taxonomy.
  • Figure 2: The web interface for participants to generate UI action traces with impacts, including the mobile screen on the left, and login and recording functions on the right.
  • Figure 3: An annotated example of a monetary transaction. Each of the category is from the taxonomy, with square brackets indicating possible multiple selections of this category.
  • Figure 4: The distribution of the perceived impact level in our synthesized data and two existing datasets.
  • Figure 5: The distribution of task domains in our synthesized data and two existing datasets.
  • ...and 2 more figures