Table of Contents
Fetching ...

MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation

Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, Jun Wang, Weinan Zhang

TL;DR

MobileUse tackles robust autonomous mobile operation by integrating a hierarchical reflection architecture with a proactive exploration module. The three-level Reflector system (Action, Trajectory, Global) enables self-monitoring and correction across action, trajectory, and task levels, while Reflection-on-Demand balances accuracy and efficiency. A proactive exploration stage builds reusable environmental knowledge to mitigate cold-start issues. Empirical results on AndroidWorld and AndroidLab show state-of-the-art performance, and the authors release an out-of-the-box MobileUse Toolkit for real-world device automation.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled the development of mobile agents that can understand visual inputs and follow user instructions, unlocking new possibilities for automating complex tasks on mobile devices. However, applying these models to real-world mobile scenarios remains a significant challenge due to the long-horizon task execution, difficulty in error recovery, and the cold-start problem in unfamiliar environments. To address these challenges, we propose MobileUse, a GUI agent designed for robust and adaptive mobile task execution. To improve resilience in long-horizon tasks and dynamic environments, we introduce a hierarchical reflection architecture that enables the agent to self-monitor, detect, and recover from errors across multiple temporal scales-ranging from individual actions to overall task completion-while maintaining efficiency through a reflection-on-demand strategy. To tackle cold-start issues, we further introduce a proactive exploration module, which enriches the agent's understanding of the environment through self-planned exploration. Evaluations on AndroidWorld and AndroidLab benchmarks demonstrate that MobileUse establishes new state-of-the-art performance, achieving success rates of 62.9% and 44.2%, respectively. To facilitate real-world applications, we release an out-of-the-box toolkit for automated task execution on physical mobile devices, which is available at https://github.com/MadeAgents/mobile-use.

MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation

TL;DR

MobileUse tackles robust autonomous mobile operation by integrating a hierarchical reflection architecture with a proactive exploration module. The three-level Reflector system (Action, Trajectory, Global) enables self-monitoring and correction across action, trajectory, and task levels, while Reflection-on-Demand balances accuracy and efficiency. A proactive exploration stage builds reusable environmental knowledge to mitigate cold-start issues. Empirical results on AndroidWorld and AndroidLab show state-of-the-art performance, and the authors release an out-of-the-box MobileUse Toolkit for real-world device automation.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled the development of mobile agents that can understand visual inputs and follow user instructions, unlocking new possibilities for automating complex tasks on mobile devices. However, applying these models to real-world mobile scenarios remains a significant challenge due to the long-horizon task execution, difficulty in error recovery, and the cold-start problem in unfamiliar environments. To address these challenges, we propose MobileUse, a GUI agent designed for robust and adaptive mobile task execution. To improve resilience in long-horizon tasks and dynamic environments, we introduce a hierarchical reflection architecture that enables the agent to self-monitor, detect, and recover from errors across multiple temporal scales-ranging from individual actions to overall task completion-while maintaining efficiency through a reflection-on-demand strategy. To tackle cold-start issues, we further introduce a proactive exploration module, which enriches the agent's understanding of the environment through self-planned exploration. Evaluations on AndroidWorld and AndroidLab benchmarks demonstrate that MobileUse establishes new state-of-the-art performance, achieving success rates of 62.9% and 44.2%, respectively. To facilitate real-world applications, we release an out-of-the-box toolkit for automated task execution on physical mobile devices, which is available at https://github.com/MadeAgents/mobile-use.

Paper Structure

This paper contains 27 sections, 6 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Overview of the MobileUse agent. In the Proactive Exploration stage, MobileUse familiarizes itself with the new environment while systematically exploring and accumulating common knowledge. In the Autonomous Mobile Operation stage, given a user instruction, in each step the Operator will observe the screenshot and output a specific action. Then MobileUse will perform hierarchical reflection at three different levels for robust task execution. Finally, the Progressor will summarize and update the current progress at the end of each step iteration.
  • Figure 2: Overview of the hierarchical reflection architecture. The Action Reflector operates on a single step to provide immediate feedback. The Trajectory Reflector operates on the latest trajectory to ensure effective progress. The Global Reflector operates on the overall interaction history to validate the task completion.
  • Figure 3: Confusion matrix of the task completion with and without hierarchical reflection on the AndroidWorld benchmark.
  • Figure 4: Error type analysis with and without hierarchical reflection on the AndroidWorld benchmark.
  • Figure 5: Performance w.r.t. different thresholds of the Reflection-on-Demand strategy on the AndroidWorld benchmark.
  • ...and 9 more figures