Table of Contents
Fetching ...

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui

Abstract

Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating a structured error analysis with corrective recommendations. Overall Summary aggregates per-subtask diagnoses into a task-level judgment. By operating on bounded subtask segments rather than full trajectories, GUIDE mitigates the context overload that degrades existing evaluators as task complexity grows. We validate GUIDE on three benchmarks: an industrial e-commerce dataset of 932 trajectories, AGENTREWARDBENCH spanning five web agent tasks with 1302 trajectories, and AndroidBench for mobile device control. Across all settings, GUIDE substantially outperforms existing evaluators-achieving up to 5.35 percentage points higher accuracy than the strongest baseline-while producing structured diagnostic reports that directly inform agent improvement.

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

Abstract

Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating a structured error analysis with corrective recommendations. Overall Summary aggregates per-subtask diagnoses into a task-level judgment. By operating on bounded subtask segments rather than full trajectories, GUIDE mitigates the context overload that degrades existing evaluators as task complexity grows. We validate GUIDE on three benchmarks: an industrial e-commerce dataset of 932 trajectories, AGENTREWARDBENCH spanning five web agent tasks with 1302 trajectories, and AndroidBench for mobile device control. Across all settings, GUIDE substantially outperforms existing evaluators-achieving up to 5.35 percentage points higher accuracy than the strongest baseline-while producing structured diagnostic reports that directly inform agent improvement.

Paper Structure

This paper contains 37 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison of GUI agent evaluation paradigms. From left to right: (1) text-only LLM evaluators that ignore visual state; (2) lightweight multimodal methods (AgentTrek) that append only the final screenshot; (3) WebJudge, which retrieves key frames from the trajectory and evaluates them holistically in a single context; and (4) GUIDE (ours), which decomposes the trajectory into subtask segments and applies structured diagnosis to each, combining full visual coverage with bounded per-call context.
  • Figure 2: Overview of the GUIDE framework. Given a task description and a full agent trajectory, Module 1 (Trajectory Segmentation) partitions the action-observation trace into semantically coherent subtask units. Module 2 (Subtask Diagnosis) evaluates each unit independently, producing a structured diagnosis with a completion verdict, error analysis, and corrective recommendations. Module 3 (Overall Summary) aggregates the per-subtask diagnoses into a final task-level judgment.
  • Figure 3: Statistics of the industrial e-commerce dataset. (a) Trajectory length distribution across six groups. (b) Overall success/failure ratio (37.0% vs. 63.0%). (c) Per-group success rate; success decreases monotonically from 39.4% ($<$10 steps) to 18.2% (50--80 steps), confirming that longer trajectories are inherently more challenging.