Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition
Pierre-Louis Favreau, Jean-Pierre Lo, Clement Guiguet, Charles Simon-Meunier, Nicolas Dehandschoewercker, Allen G. Roush, Judah Goldfeder, Ravid Shwartz-Ziv
TL;DR
The paper addresses the challenge of fully automating mobile UI tasks by AndroidWorld and introduces Minitap, a six-agent system (Planner, Orchestrator, Contextor, Cortex, Executor, Screen Analyzer) plus a Summarizer, augmented by verified execution and meta-cognitive reasoning. By decomposing cognition across specialized agents and enforcing deterministic post-validation, Minitap overcomes historical failure modes related to context management, execution reliability, and error recovery, achieving $100\%$ on $116$ tasks across $20$ apps and surpassing the human baseline of $80\%$ by $20$ points. Ablation studies quantify the contribution of each component, with multi-agent decomposition contributing $+21$ points, verified execution $+15$ points, and meta-cognitive reasoning $+9$ points, demonstrating that all elements are necessary for peak performance. The work also analyzes cost-efficiency through a Pareto frontier, showing that selective use of frontier models (e.g., Cortex) yields the same performance at substantially lower cost, and it discusses broader implications for modular, verified agent design in mobile automation, with open-source release for reproducibility and further research.
Abstract
We present Minitap, a multi-agent system that achieves 100% success on the AndroidWorld benchmark, the first to fully solve all 116 tasks and surpassing human performance (80%). We first analyze why single-agent architectures fail: context pollution from mixed reasoning traces, silent text input failures undetected by the agent, and repetitive action loops without escape. Minitap addresses each failure through targeted mechanisms: cognitive separation across six specialized agents, deterministic post-validation of text input against device state, and meta-cognitive reasoning that detects cycles and triggers strategy changes. Ablations show multi-agent decomposition contributes +21 points over single-agent baselines; verified execution adds +7 points; meta-cognition adds +9 points. We release Minitap as open-source software. https://github.com/minitap-ai/mobile-use
