Table of Contents
Fetching ...

SpiritSight Agent: Advanced GUI Agent with One Look

Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan

TL;DR

SpiritSight tackles the core challenge of grounding accuracy in vision-based GUI agents by introducing GUI-Lasagne, a large-scale, multi-level pretraining dataset, and Universal Block Parsing (UBP) to resolve positional ambiguity in dynamic high-resolution inputs. The approach is trained end-to-end on GUI-Lasagne and shows state-of-the-art performance across web, mobile, and desktop benchmarks, with strong cross-platform and cross-language transfer. The work demonstrates that end-to-end vision-based GUI agents can rival multi-stage and language-driven methods while maintaining broad platform compatibility, though it acknowledges privacy-related limitations inherent to screenshot-based systems. Overall, SpiritSight advances GUI automation by combining scalable grounding data, robust spatial parsing, and efficient model tuning, enabling practical deployment in real-world GUI navigation tasks.

Abstract

Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose $\textbf{SpiritSight}$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called $\textbf{GUI-Lasagne}$ using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models and datasets are available at https://hzhiyuan.github.io/SpiritSight-Agent.

SpiritSight Agent: Advanced GUI Agent with One Look

TL;DR

SpiritSight tackles the core challenge of grounding accuracy in vision-based GUI agents by introducing GUI-Lasagne, a large-scale, multi-level pretraining dataset, and Universal Block Parsing (UBP) to resolve positional ambiguity in dynamic high-resolution inputs. The approach is trained end-to-end on GUI-Lasagne and shows state-of-the-art performance across web, mobile, and desktop benchmarks, with strong cross-platform and cross-language transfer. The work demonstrates that end-to-end vision-based GUI agents can rival multi-stage and language-driven methods while maintaining broad platform compatibility, though it acknowledges privacy-related limitations inherent to screenshot-based systems. Overall, SpiritSight advances GUI automation by combining scalable grounding data, robust spatial parsing, and efficient model tuning, enabling practical deployment in real-world GUI navigation tasks.

Abstract

Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose , a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models and datasets are available at https://hzhiyuan.github.io/SpiritSight-Agent.

Paper Structure

This paper contains 44 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Our SpiritSight agent achieves new state-of-the-art (SOTA) performance across various benchmarks in web, mobile, and desktop scenarios.
  • Figure 2: Comparison of the average step success rate on Multimodal-Mind2Web benchmark of our SpiritSight agent of three sizes (2B, 8B, 26B) with various previous methods.
  • Figure 3: The overview of our SpiritSight agent. We develop a large-scale, multi-level, high-quality pre-training dataset that equips SpiritSight with three levels of comprehensive GUI knowledge. Additionally, we introduce a Universal Block Parsing (UBP) method to enhance SpiritSight's grounding capabilities.
  • Figure 4: The collection pipeline of our GUILasagne dataset. The left, middle and right parts show the construction of level-1, level-2, and level-3 data, respectively.
  • Figure 5: (a) Comparison between baseline block parsing and our proposed UBP. (b) The results of baseline block parsing and our proposed UBP methods on Multimodal-Mind2Web benchmark. UBP improves the performance of our model. The combination of UBP and 2D-BPE achieves the best results.
  • ...and 6 more figures