Table of Contents
Fetching ...

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar

TL;DR

This work tackles the patchy perception problem in GUI-enabled agents by introducing ScreenParse, a large-scale, densely annotated dataset that captures the complete on-screen UI structure, and ScreenVLM, a lightweight vision-language model that outputs a compact ScreenTag representation. The authors present Webshot, an automated pipeline that renders diverse web pages, aligns DOM elements with visual content, and refines annotations with VLM guidance to produce high-quality dense labels across 55 UI classes. Dense screen supervision enables ScreenVLM to outperform much larger foundation VLMs on in-domain dense parsing and to transfer structural priors to public UI benchmarks; fine-tuning foundation VLMs on ScreenParse similarly yields robust gains, underscoring the generality of the approach. Together, the dataset, model, and structure-aware training objective offer a practical path toward efficient, on-device UI understanding and improved grounding for computer-use agents.

Abstract

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

TL;DR

This work tackles the patchy perception problem in GUI-enabled agents by introducing ScreenParse, a large-scale, densely annotated dataset that captures the complete on-screen UI structure, and ScreenVLM, a lightweight vision-language model that outputs a compact ScreenTag representation. The authors present Webshot, an automated pipeline that renders diverse web pages, aligns DOM elements with visual content, and refines annotations with VLM guidance to produce high-quality dense labels across 55 UI classes. Dense screen supervision enables ScreenVLM to outperform much larger foundation VLMs on in-domain dense parsing and to transfer structural priors to public UI benchmarks; fine-tuning foundation VLMs on ScreenParse similarly yields robust gains, underscoring the generality of the approach. Together, the dataset, model, and structure-aware training objective offer a practical path toward efficient, on-device UI understanding and improved grounding for computer-use agents.

Abstract

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.
Paper Structure (65 sections, 8 equations, 15 figures, 9 tables)

This paper contains 65 sections, 8 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Class distribution of the top-20 most frequent UI elements in the ScreenParse dataset.
  • Figure 2: Qualitative example from ScreenParse illustrating dense, complete UI annotations visualized as labeled bounding boxes.
  • Figure 3: Overview of the Webshot dataset generation pipeline. Our scalable framework renders diverse URLs with Playwright and extracts DOM-driven dense annotations. VLMs further refine UI element types and filter low-quality samples.
  • Figure 4: Overview of the ScreenVLM architecture. A screenshot is encoded by the SigLIP-2 vision encoder tschannen2025siglip into patch tokens, which are projected and fed to the Granite-165M LLM granitecodemodels decoder together with text tokens to generate the ScreenTag sequence.
  • Figure 5: Training/Validation loss and accuracy curves for the YOLO component.
  • ...and 10 more figures