Table of Contents
Fetching ...

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang

TL;DR

EDGE introduces a fully automated data-synthesis framework that leverages large-scale web pages to create rich, multi-granularity GUI training data for open-source LVLMs. By combining elementary element-grounding tasks with advanced interaction tasks and an icon-understanding component, EDGE substantially improves grounded GUI understanding and interaction across web, desktop, and mobile environments. Evaluations on GUI benchmarks (VisualWebBench, ScreenSpot) and agent benchmarks (MiniWob, AITW, Mind2Web) show strong gains over baselines, with ablations confirming the value of advanced tasks. The approach reduces manual annotation needs and demonstrates the potential of public web resources to advance GUI-focused LVLMs and downstream agent capabilities.

Abstract

Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at https://anonymous.4open.science/r/EDGE-1CDB.

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

TL;DR

EDGE introduces a fully automated data-synthesis framework that leverages large-scale web pages to create rich, multi-granularity GUI training data for open-source LVLMs. By combining elementary element-grounding tasks with advanced interaction tasks and an icon-understanding component, EDGE substantially improves grounded GUI understanding and interaction across web, desktop, and mobile environments. Evaluations on GUI benchmarks (VisualWebBench, ScreenSpot) and agent benchmarks (MiniWob, AITW, Mind2Web) show strong gains over baselines, with ablations confirming the value of advanced tasks. The approach reduces manual annotation needs and demonstrates the potential of public web resources to advance GUI-focused LVLMs and downstream agent capabilities.

Abstract

Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at https://anonymous.4open.science/r/EDGE-1CDB.

Paper Structure

This paper contains 53 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Text-based agents take extracted textual metadata as input (e.g., HTML) and perform actions through specific backends (such as browser engines). Vision-based agents directly read the screen and execute actions like mouse clicks.
  • Figure 2: The diagram of the annotation stage, where the extraction of rich latent semantics and the information integration process is highlighted with the (simplified) HTML of the relevant elements presented. Elements are marked with rectangular boxes for demonstration purposes only.
  • Figure 3: The synthesis of the elementary and advanced tasks.
  • Figure 4: Statistics of EDGE with respect to the number of images. The pie chart gives an overview of the distribution of samples from different tasks and environments, while the right displays the number of samples from different webpage sources within three task settings of the web environment.
  • Figure 5: Ablation results on VisualWebBench and ScreenSpot. Scores of ScreenSpot are averaged over three environments.
  • ...and 9 more figures