Table of Contents
Fetching ...

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, Yaohua Tang

TL;DR

DeskVision addresses the shortage of desktop GUI data for training GUI agents by introducing AutoCaptioner, a scalable data-generation pipeline that yields richly captioned region data. Leveraging this data, DeskVision and DeskVision-Eval provide large-scale desktop resources and benchmarks, enabling training of GUIExplorer, a GUI understanding model that achieves state-of-the-art grounding with a lightweight, architecture-agnostic design. Across benchmarks, DeskVision improves LVLM performance on desktop tasks and generalizes to mobile and web domains, with ablations confirming substantial gains when DeskVision data is used for fine-tuning. The work offers open-source datasets and demonstrates a practical path toward robust, scalable desktop GUI agents.

Abstract

The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

TL;DR

DeskVision addresses the shortage of desktop GUI data for training GUI agents by introducing AutoCaptioner, a scalable data-generation pipeline that yields richly captioned region data. Leveraging this data, DeskVision and DeskVision-Eval provide large-scale desktop resources and benchmarks, enabling training of GUIExplorer, a GUI understanding model that achieves state-of-the-art grounding with a lightweight, architecture-agnostic design. Across benchmarks, DeskVision improves LVLM performance on desktop tasks and generalizes to mobile and web domains, with ablations confirming substantial gains when DeskVision data is used for fine-tuning. The work offers open-source datasets and demonstrates a practical path toward robust, scalable desktop GUI agents.

Abstract

The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.

Paper Structure

This paper contains 16 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Data Sourcing. (a) Examples of three data types. (b) Our data sourcing pipeline, which consists of three stages. Stage 1 uses limited labeled data to train the initial version of the classifier; Stage 2 employs an iterative training method, sequentially inputting certain numbers of images (5k in our setting) to update the data pool and using the updated pool to iteratively train the classifier. Stage 3 uses the final frozen classifier to clean large amounts of source data.
  • Figure 2: Data annotation pipeline. The screenshot is first sent to UI Detector to detect the interactive UI elements. Detected UI elements are then filtered, sampled and marked. After that, UI elements are sent to UI Captioner to generate final region captions.
  • Figure 3: Examples of data annotations on a single screenshot from different methods/datasets. Human_Annotation's result is manually labeled. Errors in the captions are marked in red, while detailed and accurate captions are marked in green.
  • Figure 4: Statistics of DeskVision. (a) The distribution of caption lengths. (b) The UI categories of the labelled elements. (c) The types of elements within each OS. (d) The heatmap of the spatial distribution of annotation elements in the normalized image.