AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation
Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, Yuanchun Li
TL;DR
AutoDroid-V2 reframes mobile GUI task automation as on-device code generation guided by a fine-grained app document, enabling script-based execution with strong privacy and efficiency benefits. It introduces offline app document generation (state grouping, abstract elements, and ETG dependencies) and online data synthesis plus a runtime interpreter with dependency-aware execution and prompt caching. Through extensive experiments on DroidTask and AitW-subset, it demonstrates higher task success rates, substantially lower latency, and reduced token costs compared with prior step-wise and on-device baselines, all while maintaining robust performance across model sizes. The work highlights the potential of document-guided scripting for scalable, on-device GUI agents with practical privacy and cost advantages, and discusses limitations and avenues for future integration with step-wise reasoning and vision-based grounding.
Abstract
Large language models (LLMs) have brought exciting new advances to mobile UI agents, a long-standing research field that aims to complete arbitrary natural language tasks through mobile UI interactions. However, existing UI agents usually demand powerful large language models that are difficult to be deployed locally on end-users' devices, raising huge concerns about user privacy and centralized serving cost. Inspired by the remarkable coding abilities of recent small language models (SLMs), we propose to convert the UI task automation problem to a code generation problem, which can be effectively solved by an on-device SLM and efficiently executed with an on-device code interpreter. Unlike normal coding tasks that can be extensively pre-trained with public datasets, generating UI automation code is challenging due to the diversity, complexity, and variability of target apps. Therefore, we adopt a document-centered approach that automatically builds fine-grained API documentation for each app and generates diverse task samples based on this documentation. By guiding the agent with the synthetic documents and task samples, it learns to generate precise and efficient scripts to complete unseen tasks. Based on detailed comparisons with state-of-the-art mobile UI agents, our approach effectively improves the mobile task automation with significantly higher success rates and lower latency/token consumption. Code is open-sourced at https://github.com/MobileLLM/AutoDroid-V2.
