PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels
Qi Yang, Weichen Bi, Haiyang Shen, Yaoqi Guo, Yun Ma
TL;DR
PixelWeb introduces a pixel-accurate web GUI dataset built from two modules, Channel Derivation and Layer Analysis, to produce BGRA element bitmaps with precise coordinates and visibility ordering. By combining chroma-key-based element extraction with DOM-aware layering, PixelWeb generates high-quality masks, contours, and BBoxes for over 100k pages, with manual verification confirming annotation fidelity. Empirical results show 3–7× improvements on $mAP_{95}$ for GUI element detection compared to prior datasets, and user studies corroborate annotation quality gains. The dataset's rich element-level metadata enables a wide range of GUI tasks, from precise element retrieval and generation to advanced layout and interaction modeling, with potential extension to mobile apps and non-web GUIs.
Abstract
Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
