Table of Contents
Fetching ...

Agent-SAMA: State-Aware Mobile Assistant

Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun, Chen, Yang Wang

TL;DR

Agent-SAMA introduces a state-aware mobile GUI agent framework that models app usage as a Finite State Machine, enabling structured planning, real-time execution, verification, and recovery across tasks. The system deploys four specialized agents—Planner, Screen Parser/State Agent/Actor, Reflection, and Mentor—to build per-app FSMs, validate progress, recover from errors, and accumulate long-term knowledge. Across cross-app benchmarks Mobile-Eval-E and SPA-Bench, Agent-SAMA achieves notable gains in Success Rate and Recovery Success compared to baselines, and also demonstrates solid performance on AndroidWorld, illustrating enhanced robustness and planning efficiency. This FSM-based approach provides a lightweight, model-agnostic memory layer that improves task reliability and recoverability in complex mobile environments, with implications for more resilient autonomous GUI agents.

Abstract

Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, limiting GUI agents' ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent-SAMA, a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent-SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent-SAMA on two types of benchmarks: cross-app (Mobile-Eval-E, SPA-Bench) and mostly single-app (AndroidWorld). On Mobile-Eval-E, Agent-SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA-Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent-SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent-SAMA achieves a 63.7% success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.

Agent-SAMA: State-Aware Mobile Assistant

TL;DR

Agent-SAMA introduces a state-aware mobile GUI agent framework that models app usage as a Finite State Machine, enabling structured planning, real-time execution, verification, and recovery across tasks. The system deploys four specialized agents—Planner, Screen Parser/State Agent/Actor, Reflection, and Mentor—to build per-app FSMs, validate progress, recover from errors, and accumulate long-term knowledge. Across cross-app benchmarks Mobile-Eval-E and SPA-Bench, Agent-SAMA achieves notable gains in Success Rate and Recovery Success compared to baselines, and also demonstrates solid performance on AndroidWorld, illustrating enhanced robustness and planning efficiency. This FSM-based approach provides a lightweight, model-agnostic memory layer that improves task reliability and recoverability in complex mobile environments, with implications for more resilient autonomous GUI agents.

Abstract

Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, limiting GUI agents' ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent-SAMA, a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent-SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent-SAMA on two types of benchmarks: cross-app (Mobile-Eval-E, SPA-Bench) and mostly single-app (AndroidWorld). On Mobile-Eval-E, Agent-SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA-Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent-SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent-SAMA achieves a 63.7% success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.

Paper Structure

This paper contains 23 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An example of how Agent-SAMA represents real-time UI interactions as a Finite State Machine (FSM). The left side shows the dynamic UI transitions of the Walmart App along with the user action (e.g., typing and tapping) that leads to a new UI screen. The right side shows the corresponding FSM, where each UI screen is represented as a state (a natural language description of the screen generated by MLLM) with its MLLM-generated pre- and post-condition. The user action defines the transition between the states. Given the current state and the entire FSM, Agent-SAMA also predicts the possible next state.
  • Figure 2: An overview of Agent-SAMA. The Planner, Actor, Screen Parser, StateAgent, and Reflection Agent are involved in the main agent loop for each task, while Mentor contributes to updating long-term reusable knowledge across tasks. Decision-making at each step is disentangled into high-level planning by the Planner and low-level actions by the Actor. The State Agent builds FSMs dynamically, and the Reflection Agent verifies the outcome of each action, tracks progress, and provides error recovery.
  • Figure 3: Comparison between the previous state-of-the-art, Mobile-Agent-E wang2025mobile, and Agent-SAMA.