Table of Contents
Fetching ...

AppVLM: A Lightweight Vision Language Model for Online App Control

Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao

TL;DR

This work addresses the challenge of efficient and generalizable smartphone app control by introducing AppVLM, a lightweight 3B vision-language model trained to execute human instructions on Android devices. The method combines offline supervised fine-tuning on AndroidControl with Reinforce Fine-Tuning in a distributed AndroidWorld environment, iterating data collection and offline policy improvement to extend beyond training distributions. Empirically, AppVLM achieves state-of-the-art action-prediction on AndroidControl and online task completion rates in AndroidWorld comparable to GPT-4o, while delivering an order of magnitude faster inference and lower compute cost. The results demonstrate a practical, scalable pathway toward robust app automation, while acknowledging limitations related to training data coverage and the need for standardized app-control datasets and reward-models for broader RL applicability.

Abstract

The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.

AppVLM: A Lightweight Vision Language Model for Online App Control

TL;DR

This work addresses the challenge of efficient and generalizable smartphone app control by introducing AppVLM, a lightweight 3B vision-language model trained to execute human instructions on Android devices. The method combines offline supervised fine-tuning on AndroidControl with Reinforce Fine-Tuning in a distributed AndroidWorld environment, iterating data collection and offline policy improvement to extend beyond training distributions. Empirically, AppVLM achieves state-of-the-art action-prediction on AndroidControl and online task completion rates in AndroidWorld comparable to GPT-4o, while delivering an order of magnitude faster inference and lower compute cost. The results demonstrate a practical, scalable pathway toward robust app automation, while acknowledging limitations related to training data coverage and the need for standardized app-control datasets and reward-models for broader RL applicability.

Abstract

The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.

Paper Structure

This paper contains 26 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualisation of the RFT pipeline. Data is gathered by interactions between the emulators and AppVLM. The data is gathered, preprocessed and added to the dataset. It is used to perform a fine-tuning step.
  • Figure 2: Example trajectory in AndroidWorld, with the goal at the top and the taken actions below each timestep's screenshot.The agent almost succeeds in solving this task, but forgets to clear the text field before typing in the penultimate step.
  • Figure 3: Example AndroidWorld observation passed as input to AppVLM. The visual input is composed of the current screenshot, annotated with bounding boxes surrounding clickable UI elements, along with numbered labels. The textual input is composed of the task goal, as well as the history of actions. This observation corresponds to the input for step 2 in \ref{['fig:success_contact']}.
  • Figure 4: Example trajectory in AndroidWorld, with the goal at the top and the taken actions below each timestep's screenshot. AppVLM successfully creates an audio recording and saves it with the appropriate filename. Step 6 is noteworthy, with the agent opting for a long-press action, which is very rare in the initial AndroidControl dataset. This figure is in direct juxtaposition with \ref{['fig:fail_trajectory']}.
  • Figure 5: Example trajectory in AndroidWorld, with the goal at the top and the taken actions below each timestep's screenshot. AppVLM successfully creates a new contact, filling out several form fields to do so.
  • ...and 2 more figures