Table of Contents
Fetching ...

GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li

TL;DR

GUI Knowledge Bench tackles the knowledge gap in vision–language models for GUI task automation by identifying three core GUI knowledge dimensions—interface perception, interaction prediction, and instruction understanding—and proposing a large, cross-platform benchmark. The benchmark combines data from multiple sources to produce 3,483 knowledge-centric questions over 292 apps across six platforms, enabling pre-training diagnostics beyond traditional task-success metrics. Benchmark results reveal consistent gaps in state reasoning, action anticipation, and completion verification, with plan augmentation and knowledge integration improving performance. The work provides a practical framework to guide model selection and future research toward enriching VLMs with domain-specific GUI knowledge for more capable GUI agents.

Abstract

Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.

GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

TL;DR

GUI Knowledge Bench tackles the knowledge gap in vision–language models for GUI task automation by identifying three core GUI knowledge dimensions—interface perception, interaction prediction, and instruction understanding—and proposing a large, cross-platform benchmark. The benchmark combines data from multiple sources to produce 3,483 knowledge-centric questions over 292 apps across six platforms, enabling pre-training diagnostics beyond traditional task-success metrics. Benchmark results reveal consistent gaps in state reasoning, action anticipation, and completion verification, with plan augmentation and knowledge integration improving performance. The work provides a practical framework to guide model selection and future research toward enriching VLMs with domain-specific GUI knowledge for more capable GUI agents.

Abstract

Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.

Paper Structure

This paper contains 29 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: GUI Knowledge Bench: A benchmark evaluating VLMs on GUI knowledge across six platforms (Web, Android, MacOS, Windows, Linux, IOS). It measures three types of knowledge: Interface Perception, which evaluates understanding of GUI components, layout, and system state; Interaction Prediction, which assesses the ability to anticipate user actions and foresee their effects on the interface; and Instruction Understanding, which tests whether a model can grasp task goals and plan correct execution steps.
  • Figure 2: Example questions for Interface Perception. red bounding box
  • Figure 3: Example questions for Interaction Prediction.
  • Figure 4: Example questions for Instruction Understanding.
  • Figure 5: Statistics of GUI Knowledge Bench, including question type distribution, images per question, image size distribution, and app category counts.
  • ...and 6 more figures