Table of Contents
Fetching ...

PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

Qi Yang, Weichen Bi, Haiyang Shen, Yaoqi Guo, Yun Ma

TL;DR

PixelWeb introduces a pixel-accurate web GUI dataset built from two modules, Channel Derivation and Layer Analysis, to produce BGRA element bitmaps with precise coordinates and visibility ordering. By combining chroma-key-based element extraction with DOM-aware layering, PixelWeb generates high-quality masks, contours, and BBoxes for over 100k pages, with manual verification confirming annotation fidelity. Empirical results show 3–7× improvements on $mAP_{95}$ for GUI element detection compared to prior datasets, and user studies corroborate annotation quality gains. The dataset's rich element-level metadata enables a wide range of GUI tasks, from precise element retrieval and generation to advanced layout and interaction modeling, with potential extension to mobile apps and non-web GUIs.

Abstract

Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.

PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

TL;DR

PixelWeb introduces a pixel-accurate web GUI dataset built from two modules, Channel Derivation and Layer Analysis, to produce BGRA element bitmaps with precise coordinates and visibility ordering. By combining chroma-key-based element extraction with DOM-aware layering, PixelWeb generates high-quality masks, contours, and BBoxes for over 100k pages, with manual verification confirming annotation fidelity. Empirical results show 3–7× improvements on for GUI element detection compared to prior datasets, and user studies corroborate annotation quality gains. The dataset's rich element-level metadata enables a wide range of GUI tasks, from precise element retrieval and generation to advanced layout and interaction modeling, with potential extension to mobile apps and non-web GUIs.

Abstract

Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.

Paper Structure

This paper contains 18 sections, 13 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Error cases of BBox label in WebUI dataset
  • Figure 2: Example of a web page annotated by our approach
  • Figure 3: Approach overview. 1. Input an open web page. 2. The channel derivation module extracts the image of each element. 3. The layer analysis analyzes the rendering layer of elements to determine the image and coordinates of each GUI element. 4. Based on this information, it sequentially derives mask, contour, and BBox annotations.
  • Figure 4: Channel derivation
  • Figure 5: Examples of extreme cases for graphic color channel derivation formula
  • ...and 5 more figures