ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction
Yiqiao Jin, Stefano Petrangeli, Yu Shen, Gang Wu
TL;DR
The paper tackles the challenge of training GUI agents under sparse supervision and scalability constraints by introducing a stateful screen schema that encodes dynamic GUI interactions over time. Building on this, ScreenLLM combines a stateful schema generator, a memory module, and multimodal LLMs to perform UI understanding and action prediction, using techniques like key-frame extraction via second-order pixel changes, OCR-based UI element detection, and cursor localization. Empirical results on open-source and proprietary backbones show substantial gains in action understanding and future-action prediction, with notable improvements such as BLEU-2 and ROUGE-L increases on LLaVA-13B and BLEU-2 and CIDEr-D gains on GPT-4o. The work provides a scalable, robust framework for GUI agents applicable across diverse software environments, and outlines avenues for real-time action generation and privacy-conscious data usage.
Abstract
Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.
