SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan
TL;DR
SpiritSight tackles the core challenge of grounding accuracy in vision-based GUI agents by introducing GUI-Lasagne, a large-scale, multi-level pretraining dataset, and Universal Block Parsing (UBP) to resolve positional ambiguity in dynamic high-resolution inputs. The approach is trained end-to-end on GUI-Lasagne and shows state-of-the-art performance across web, mobile, and desktop benchmarks, with strong cross-platform and cross-language transfer. The work demonstrates that end-to-end vision-based GUI agents can rival multi-stage and language-driven methods while maintaining broad platform compatibility, though it acknowledges privacy-related limitations inherent to screenshot-based systems. Overall, SpiritSight advances GUI automation by combining scalable grounding data, robust spatial parsing, and efficient model tuning, enabling practical deployment in real-world GUI navigation tasks.
Abstract
Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose $\textbf{SpiritSight}$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called $\textbf{GUI-Lasagne}$ using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models and datasets are available at https://hzhiyuan.github.io/SpiritSight-Agent.
