Lightweight Neural App Control
Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
TL;DR
LiMAC tackles efficient on-device mobile app control by integrating a compact Action Transformer (AcT) with a small, fine-tuned vision-language model to handle text generation tasks. The framework uses a contrastive objective to predict click targets and relies on a fine-tuned VLM for open-app and input-text actions, achieving substantial speedups and accuracy gains over GPT-4o baselines and larger VLMs. Across AndroidControl and Android-in-the-Wild, LiMAC demonstrates up to 19% improvement in action accuracy versus fine-tuned VLMs and up to 42% against prompt-engineered baselines, while enabling on-device inference that is up to 30x faster. The work also provides a modular design that accommodates different module pairings, highlights the importance of visual features and CLIP fine-tuning, and outlines future directions in online learning and safety considerations for mobile agents.
Abstract
This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.
