Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition

Pierre-Louis Favreau; Jean-Pierre Lo; Clement Guiguet; Charles Simon-Meunier; Nicolas Dehandschoewercker; Allen G. Roush; Judah Goldfeder; Ravid Shwartz-Ziv

Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition

Pierre-Louis Favreau, Jean-Pierre Lo, Clement Guiguet, Charles Simon-Meunier, Nicolas Dehandschoewercker, Allen G. Roush, Judah Goldfeder, Ravid Shwartz-Ziv

TL;DR

The paper addresses the challenge of fully automating mobile UI tasks by AndroidWorld and introduces Minitap, a six-agent system (Planner, Orchestrator, Contextor, Cortex, Executor, Screen Analyzer) plus a Summarizer, augmented by verified execution and meta-cognitive reasoning. By decomposing cognition across specialized agents and enforcing deterministic post-validation, Minitap overcomes historical failure modes related to context management, execution reliability, and error recovery, achieving $100\%$ on $116$ tasks across $20$ apps and surpassing the human baseline of $80\%$ by $20$ points. Ablation studies quantify the contribution of each component, with multi-agent decomposition contributing $+21$ points, verified execution $+15$ points, and meta-cognitive reasoning $+9$ points, demonstrating that all elements are necessary for peak performance. The work also analyzes cost-efficiency through a Pareto frontier, showing that selective use of frontier models (e.g., Cortex) yields the same performance at substantially lower cost, and it discusses broader implications for modular, verified agent design in mobile automation, with open-source release for reproducibility and further research.

Abstract

We present Minitap, a multi-agent system that achieves 100% success on the AndroidWorld benchmark, the first to fully solve all 116 tasks and surpassing human performance (80%). We first analyze why single-agent architectures fail: context pollution from mixed reasoning traces, silent text input failures undetected by the agent, and repetitive action loops without escape. Minitap addresses each failure through targeted mechanisms: cognitive separation across six specialized agents, deterministic post-validation of text input against device state, and meta-cognitive reasoning that detects cycles and triggers strategy changes. Ablations show multi-agent decomposition contributes +21 points over single-agent baselines; verified execution adds +7 points; meta-cognition adds +9 points. We release Minitap as open-source software. https://github.com/minitap-ai/mobile-use

Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition

TL;DR

tasks across

apps and surpassing the human baseline of

points. Ablation studies quantify the contribution of each component, with multi-agent decomposition contributing

points, verified execution

points, and meta-cognitive reasoning

points, demonstrating that all elements are necessary for peak performance. The work also analyzes cost-efficiency through a Pareto frontier, showing that selective use of frontier models (e.g., Cortex) yields the same performance at substantially lower cost, and it discusses broader implications for modular, verified agent design in mobile automation, with open-source release for reproducibility and further research.

Abstract

Paper Structure (37 sections, 3 figures, 6 tables)

This paper contains 37 sections, 3 figures, 6 tables.

Introduction
Related Work
Problem Setting
The AndroidWorld Benchmark
Failure Mode Analysis
System Architecture
Multi-Agent Decomposition
Agent Specifications
Utility Agents
Execution Flow
Verified Execution
Platform Automation
Meta-Cognitive Reasoning
Supporting Mechanisms
Evaluation Infrastructure
...and 22 more sections

Figures (3)

Figure 1: Minitap uses six specialized agents with the Cortex as the central reasoning bottleneck. The system decomposes mobile automation into two phases: (1) Plan, where the Planner decomposes user goals into ordered subgoals and the Orchestrator tracks their completion status, and (2) Act, where the Contextor captures device state, the Cortex reasons over the state to select actions, and the Executor validates and dispatches commands to the platform-agnostic Device Controller. The Summarizer compresses action history to manage context length. The Convergence node routes control flow: continuing to the next iteration, triggering replanning upon subgoal failure, or terminating upon task completion. Icons indicate model capacity requirements: the brain denotes high-capacity reasoning models; the lightning bolt denotes distillable components amenable to smaller, faster models.
Figure 2: The Cortex is the critical reasoning bottleneck; other agents tolerate smaller models. Task success rate on a 50-task AndroidWorld subset stratified by app when individual agents are downgraded from Gemini 2.5 Pro to Qwen 8B-VL. Downgrading the Planner, Contextor, or Orchestrator reduces performance to 51--58%, while downgrading the Cortex collapses success to 11%. Dashed line indicates 50% threshold.
Figure 3: Intelligent model allocation (Platform Default) achieves frontier performance at 32% lower cost. Cost-performance Pareto frontier across nine agent configurations on AndroidWorld. Gold halos indicate Pareto-optimal configurations. Platform Default (1) matches All Frontier's 100% success rate while reducing cost from $1.58 to $1.07 per task by using a frontier model only for the reasoning-critical Cortex agent and budget models elsewhere. The dashed line connects Pareto-optimal points. Numbers correspond to: (1) Platform Default, (2) All Frontier, (3) Degrade Planner, (4) Degrade Contextor, (5) Frontier Cortex Only, (6) Degrade Orchestrator, (7) GPT-4o Cortex, (8) Flash Cortex, (9) Degrade Cortex.

Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition

TL;DR

Abstract

Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)