Table of Contents
Fetching ...

Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu

TL;DR

V-Droid reframes mobile GUI task automation by using LLMs as verifiers ($P^3$) to evaluate candidate actions, converting decision-making into batched prefilling verification over a discretized action space. It introduces pairwise process preference training and a scalable human–agent joint annotation scheme to train a verifier based on $ ext{Llama-3.1-8B}$. Evaluations across AndroidWorld, AndroidLab, and MobileAgentBench show SRs of $59.5 ext{\%}$, $38.3 ext{\%}$, and $49 ext{\%}$ with per-step latency of $4.3$ seconds, representing substantial improvements over prior SOTA. The work demonstrates near-real-time, verifier-driven decision-making and provides a scalable framework for deployment of mobile GUI agents.

Abstract

We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1x faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.

Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

TL;DR

V-Droid reframes mobile GUI task automation by using LLMs as verifiers () to evaluate candidate actions, converting decision-making into batched prefilling verification over a discretized action space. It introduces pairwise process preference training and a scalable human–agent joint annotation scheme to train a verifier based on . Evaluations across AndroidWorld, AndroidLab, and MobileAgentBench show SRs of , , and with per-step latency of seconds, representing substantial improvements over prior SOTA. The work demonstrates near-real-time, verifier-driven decision-making and provides a scalable framework for deployment of mobile GUI agents.

Abstract

We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1x faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.

Paper Structure

This paper contains 29 sections, 7 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Task success rate and latency per step of current mobile agents and V-Droid evaluated on AndroidWorld benchmark. The latency of 2B, 7B and 8B agents are measured on $2\times$ Nvidia 4090. For 72B or MoE agents, the latency is measure on $4\times$ Nvidia A100 80G.
  • Figure 2: The key differences in agent architecture between using LLMs as generators and as verifiers for decision-making: rather than directly determine actions based on states, verifier-driven agents explicitly evaluate each action before arriving at the decision.
  • Figure 3: The distribution of interactive UI elements within each UI page by analyzing around $25,000$ real-world UI screens from the public dataset li2024effects.
  • Figure 4: The Workflow of V-Droid: ① Extracting actions from UI and supplementing default actions; ② Constructing verification prompts with the template for candidate actions; ③ Scoring with the verifier in batch with prefix caching; ④ Completing and executing the selected action; ⑤ Updating the working memory.
  • Figure 5: Illustration of $P^3$ training used in V-Droid.
  • ...and 9 more figures