Table of Contents
Fetching ...

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng

TL;DR

AmbiBench tackles the gap between real-world user intent and mobile GUI agent execution by introducing a four-level Instruction Clarity taxonomy and an automated, interactive evaluation framework. It formalizes task intent with Ground Truth $\mathcal{U}_{gt}$ and Observed Instruction $\mathcal{I}_{obs}$, and measures alignment through the Cognitive Gap $\mathcal{G}$ and refined metrics across Outcome, Execution, and Interaction dimensions using the MUSE framework. The dataset comprises 240 tasks across 25 apps with rigorous legitimacy assurances and a four-phase construction pipeline to stress-test planning, execution, and interaction. Empirical results show that interactive agents significantly outperform non-interactive ones under ambiguous conditions, underscore the diagnostic value of fine-grained process metrics, and demonstrate a strong alignment between MUSE metrics and human judgments, establishing AmbiBench as a new standard for evaluating truly intent-aligned mobile GUI agents in dynamic environments.

Abstract

Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

TL;DR

AmbiBench tackles the gap between real-world user intent and mobile GUI agent execution by introducing a four-level Instruction Clarity taxonomy and an automated, interactive evaluation framework. It formalizes task intent with Ground Truth and Observed Instruction , and measures alignment through the Cognitive Gap and refined metrics across Outcome, Execution, and Interaction dimensions using the MUSE framework. The dataset comprises 240 tasks across 25 apps with rigorous legitimacy assurances and a four-phase construction pipeline to stress-test planning, execution, and interaction. Empirical results show that interactive agents significantly outperform non-interactive ones under ambiguous conditions, underscore the diagnostic value of fine-grained process metrics, and demonstrate a strong alignment between MUSE metrics and human judgments, establishing AmbiBench as a new standard for evaluating truly intent-aligned mobile GUI agents in dynamic environments.

Abstract

Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.
Paper Structure (35 sections, 11 equations, 7 figures, 2 tables)

This paper contains 35 sections, 11 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the motivation and architecture of AmbiBench. Non-interactive agents suffer from intent deviation due to probabilistic guessing (Top-Left), whereas interactive agents achieve intent alignment through active inquiry (Bottom-Left). To benchmark interactive agents, AmbiBench categorizes instructions into four clarity levels. A User Simulator, retaining Ground Truth information, engages in dynamic interaction with the agent, while the Sandbox synchronously captures execution traces. Subsequently, a multi-agent evaluation framework performs automated analysis, profiling agent capabilities at a fine granularity across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality.
  • Figure 2: Classification matrix for task legitimacy and effectiveness assurance. This matrix establishes task admission and transformation logic across two dimensions: Requirement Nature (Functional/Value-Laden) and Intent Origin (Prior/Posterior). Case A (Retain): Tasks characterized by Prior Intent align with human cognitive logic and possess a determinate set of requirements; thus, AmbiBench retains these in their entirety. Case B (Convert): For Posterior Intent tasks, users cannot foresee outcomes ab initio, rendering it impossible to articulate plausible Detailed or Standard instructions. However, they can naturally issue Incomplete or Ambiguous instructions. For such scenarios, AmbiBench employs a Preset Posterior transformation strategy, which populates user requirements with predefined selections to reconstruct the task as a determinate, pseudo-prior objective. Case C (Exclude): In the absence of actual content consumption, users would implicitly refuse to authorize an agent to articulate an Affective Stance on their behalf. Consequently, posterior value-laden tasks contravene the ethical logic governing intelligent agents, and AmbiBench rigorously excludes them.
  • Figure 3: The derivation process of task instructions across 4 clarity levels from requirements.
  • Figure 4: Domain distribution and difficulty quantification statistics of the AmbiBench dataset.
  • Figure 5: MUSE data acquisition pipeline. The system decouples the Subject Agent from the evaluation environment via a standardized interface: the Physical Sandbox executes instructions and captures execution traces containing screenshots and action logs, while the User Simulator answers the Subject Agent based on the requirement list and records the interaction history, collectively completing the collection of eval data.
  • ...and 2 more figures