AppVLM: A Lightweight Vision Language Model for Online App Control
Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao
TL;DR
This work addresses the challenge of efficient and generalizable smartphone app control by introducing AppVLM, a lightweight 3B vision-language model trained to execute human instructions on Android devices. The method combines offline supervised fine-tuning on AndroidControl with Reinforce Fine-Tuning in a distributed AndroidWorld environment, iterating data collection and offline policy improvement to extend beyond training distributions. Empirically, AppVLM achieves state-of-the-art action-prediction on AndroidControl and online task completion rates in AndroidWorld comparable to GPT-4o, while delivering an order of magnitude faster inference and lower compute cost. The results demonstrate a practical, scalable pathway toward robust app automation, while acknowledging limitations related to training data coverage and the need for standardized app-control datasets and reward-models for broader RL applicability.
Abstract
The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.
