Table of Contents
Fetching ...

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim

Abstract

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

Abstract

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.

Paper Structure

This paper contains 63 sections, 5 equations, 24 figures, 13 tables.

Figures (24)

  • Figure 1: An example of the GUIDE benchmark, which jointly models three tasks: Behavior State Detection, Intent Prediction, and Help Prediction, to interpret what the user is doing, aiming to achieve, and whether and what they may need assistance with during open-ended software tasks.
  • Figure 2: Overview of the three core tasks in the GUIDE benchmark. (1) User Behavior State Detection identifies the user's current behavioral mode (e.g., Exploration and Decision-Making). (2) Intent Prediction infers what the user is trying to achieve (e.g., Create a progress bar). (3) Help Prediction determines whether the user needs assistance and, if so, what kind of help is relevant (e.g., Get a guide on how to use text effects). Together, these tasks enable a comprehensive understanding of user behavior and assistance needs in software GUI environments. We evaluate MLLMs on their ability to infer these solely from the visual input, without access to the demonstrator’s narration --- a setting that closely reflects real-world use.
  • Figure 3: Our proposed taxonomy of user behavior states in GUI-based software tasks, organized into four main phases: Planning, Execution, Problem-Solving, and Evaluation. Each phase captures distinct patterns of user cognition and interaction, from initial goal formulation to iterative action, troubleshooting, and reflection.
  • Figure 4: Accuracy trends across the tasks in the online setting, where models are given progressively more of the video segment (25%, 50%, 75%, and 100%). Models show consistent improvement as they see more segments, with Gemini-2.5-Flash Gemini2025 and Qwen3-VL-8B QwenTeam2025Qwen3 showing larger and more consistent gains across all four tasks compared to the smaller open-source models.
  • Figure B1: Distribution of screen recording video lengths in the dataset.
  • ...and 19 more figures