Table of Contents
Fetching ...

Lightweight Neural App Control

Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao

TL;DR

LiMAC tackles efficient on-device mobile app control by integrating a compact Action Transformer (AcT) with a small, fine-tuned vision-language model to handle text generation tasks. The framework uses a contrastive objective to predict click targets and relies on a fine-tuned VLM for open-app and input-text actions, achieving substantial speedups and accuracy gains over GPT-4o baselines and larger VLMs. Across AndroidControl and Android-in-the-Wild, LiMAC demonstrates up to 19% improvement in action accuracy versus fine-tuned VLMs and up to 42% against prompt-engineered baselines, while enabling on-device inference that is up to 30x faster. The work also provides a modular design that accommodates different module pairings, highlights the importance of visual features and CLIP fine-tuning, and outlines future directions in online learning and safety considerations for mobile agents.

Abstract

This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

Lightweight Neural App Control

TL;DR

LiMAC tackles efficient on-device mobile app control by integrating a compact Action Transformer (AcT) with a small, fine-tuned vision-language model to handle text generation tasks. The framework uses a contrastive objective to predict click targets and relies on a fine-tuned VLM for open-app and input-text actions, achieving substantial speedups and accuracy gains over GPT-4o baselines and larger VLMs. Across AndroidControl and Android-in-the-Wild, LiMAC demonstrates up to 19% improvement in action accuracy versus fine-tuned VLMs and up to 42% against prompt-engineered baselines, while enabling on-device inference that is up to 30x faster. The work also provides a modular design that accommodates different module pairings, highlights the importance of visual features and CLIP fine-tuning, and outlines future directions in online learning and safety considerations for mobile agents.

Abstract

This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

Paper Structure

This paper contains 27 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of AcT. A separate encoding of each UI element into a vector $e_{t,i}$ by using pretrained embedding models. The embeddings are then fed into the sequence of a transformer $x_t$ along with the previous timesteps in that episode. The prediction of the transformer is decoded to produce the next action which consists of $a^{\text{type}}_t$ and $a^{\text{spec}}_t$.
  • Figure 2: The architecture of LiMAC. The history of observations-actions $\{o_{t}, a_{t-1}, o_{t-1}..\}$ and goal $g$ are processed to vector $x$ and passed to AcT. The image observation $o^{\text{img}}_t$ with the bounding boxes and the goal $g$ are passed as inputs to the VLM. The VLM is only called if an action that requires text completion is selected, based on the action type output of AcT. The action is finally selected based on the protocol described in \ref{['sec:action-type-prediction', 'sec:llm-finetunes', 'sec:predicting-click-targets']}.
  • Figure 3: Confusion matrix for action type selection for LiMAC in AndroidControl.
  • Figure 4: Relative frequency of different types of action prediction errors in the two datasets
  • Figure 5: Number of successful and failed prediction of actions with respect to the number of UI elements in the observation, for the two datasets.
  • ...and 2 more figures