Table of Contents
Fetching ...

AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs

Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang

TL;DR

AutoGUI introduces a scalable, automatic annotation pipeline that uses LLMs to infer UI element functionality from interaction-induced UI changes, coupled with LLM-aided rejection and verification to ensure data quality. The authors build AutoGUI-704k, a large dataset of contextual functionality groundings, and demonstrate that fine-tuning open-source VLMs with this data yields strong UI grounding improvements and clear scaling effects. They validate data quality against human annotations and show the data’s utility for downstream GUI agent tasks, while also outlining limitations such as Mobile App diversity and safety considerations. The work offers a practical path to overcoming data scarcity in UI grounding and highlights the potential of LLM-guided annotation for large-scale UI understanding. Overall, AutoGUI advances UI-VLM grounding capabilities and paves the way for more capable GUI agents.

Abstract

User interface understanding with vision-language models (VLMs) has received much attention due to its potential for enhancing software automation. However, existing datasets used to build UI-VLMs either only contain large-scale context-free element annotations or contextualized functional descriptions for elements at a small scale. In this work, we propose the \textbf{AutoGUI} pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing UI state changes before and after simulated interactions. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid annotations without human labor. We construct a high-quality AutoGUI-704k dataset using the proposed pipeline, featuring diverse and detailed functionality annotations that are hardly provided by previous datasets. Human evaluation shows that we achieve annotation correctness comparable to a trained human annotator. Extensive experiments show that our dataset remarkably enhances VLM's UI grounding capabilities and exhibits significant scaling effects. We also show the interesting potential use of our dataset in UI agent tasks. Please view our project at https://autogui-project.github.io/.

AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs

TL;DR

AutoGUI introduces a scalable, automatic annotation pipeline that uses LLMs to infer UI element functionality from interaction-induced UI changes, coupled with LLM-aided rejection and verification to ensure data quality. The authors build AutoGUI-704k, a large dataset of contextual functionality groundings, and demonstrate that fine-tuning open-source VLMs with this data yields strong UI grounding improvements and clear scaling effects. They validate data quality against human annotations and show the data’s utility for downstream GUI agent tasks, while also outlining limitations such as Mobile App diversity and safety considerations. The work offers a practical path to overcoming data scarcity in UI grounding and highlights the potential of LLM-guided annotation for large-scale UI understanding. Overall, AutoGUI advances UI-VLM grounding capabilities and paves the way for more capable GUI agents.

Abstract

User interface understanding with vision-language models (VLMs) has received much attention due to its potential for enhancing software automation. However, existing datasets used to build UI-VLMs either only contain large-scale context-free element annotations or contextualized functional descriptions for elements at a small scale. In this work, we propose the \textbf{AutoGUI} pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing UI state changes before and after simulated interactions. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid annotations without human labor. We construct a high-quality AutoGUI-704k dataset using the proposed pipeline, featuring diverse and detailed functionality annotations that are hardly provided by previous datasets. Human evaluation shows that we achieve annotation correctness comparable to a trained human annotator. Extensive experiments show that our dataset remarkably enhances VLM's UI grounding capabilities and exhibits significant scaling effects. We also show the interesting potential use of our dataset in UI agent tasks. Please view our project at https://autogui-project.github.io/.

Paper Structure

This paper contains 38 sections, 22 figures, 14 tables.

Figures (22)

  • Figure 1: Our annotations are rich in functional semantics (bottom) compared with existing UI datasets.
  • Figure 2: The proposed pipeline for automatic UI functionality annotation. An LLM is utilized to predict element functionality based on the UI content changes observed during the interaction. LLM-aided rejection and verification are introduced to improve data quality. Finally, the high-quality functionality annotations will be converted to instruction-following data by applying task templates.
  • Figure 3: Element functionality annotations generated by the AutoGUI pipeline for both web and mobile domains.
  • Figure 4: Diversity of the AutoGUI dataset.Left: The word cloud illustrates the ratios of the verbs representing the main intents in the functionality annotations. Right: Comparing the distributions of the annotation token numbers for our AutoGUI training split, SeeClick Web training data cheng2024seeclick, and Widget Captioning Li2020WidgetCG. The comparison demonstrates that our dataset covers significantly more diverse task lengths.
  • Figure 5: Scaling effect of the AutoGUI data. The three general-purpose VLMs are fine-tuned with three scales of AutoGUI data. Using more data consistently enhances the grounding accuracy of the three models. Note that the grounding accuracy (Y-axis) is averaged over all the element grounding benchmarks.
  • ...and 17 more figures