Table of Contents
Fetching ...

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu

TL;DR

<3-5 sentence high-level summary> The paper addresses the difficulty of solving long-horizon GUI automation with single, monolithic agents that struggle with planning and state tracking. It proposes the Coordinator-Executor-State Tracker (CES) framework, separating high-level scheduling from low-level execution and introducing a dynamic language-based state tracker to maintain task coherence. A staged execution-feedback reinforcement learning strategy trains the high-level Coordinator and State Tracker while keeping the Executor frozen, using an Execution-Feedback Reward to align learning with verifiable outcomes. Empirical results on multiple long-horizon GUI benchmarks show substantial gains in planning accuracy, state awareness, and generalization across executors, validating CES as a robust plug-and-play enhancement for GUI agents.

Abstract

The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

TL;DR

<3-5 sentence high-level summary> The paper addresses the difficulty of solving long-horizon GUI automation with single, monolithic agents that struggle with planning and state tracking. It proposes the Coordinator-Executor-State Tracker (CES) framework, separating high-level scheduling from low-level execution and introducing a dynamic language-based state tracker to maintain task coherence. A staged execution-feedback reinforcement learning strategy trains the high-level Coordinator and State Tracker while keeping the Executor frozen, using an Execution-Feedback Reward to align learning with verifiable outcomes. Empirical results on multiple long-horizon GUI benchmarks show substantial gains in planning accuracy, state awareness, and generalization across executors, validating CES as a robust plug-and-play enhancement for GUI agents.

Abstract

The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.

Paper Structure

This paper contains 47 sections, 25 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: (a) Difference between simple tasks and long-horizon tasks. A simple task only involves a single action on one screen driven by atomic instruction, whereas a long-horizon task requires a complex trajectory driven by ambiguous, high-level user instruction. (b) Comparison of how existing methods and our method address long-horizon challenges. Left (Responsibility Coupling and Capability Conflict): A single agent is overloaded by coupling high-level capability and low-level execution. Our method resolves this by decoupling these roles into high-level and low-level components. Right (Lack of Task State Awareness): A single agent loses context on ambiguous screens like the Home screen. Our State Tracker provides high-semantic memory, enabling correct, context-aware decisions.
  • Figure 2: Temporal Judgement Accuracy. While accuracy is high for adjacent steps, it drops dramatically as the step interval increases. This result empirically demonstrates that screenshots fail to represent task state sufficiently and we need a mechanism to record progress for long-horizon tasks.
  • Figure 3: The CES multi-agent loop framework. CES executes complex long-horizon tasks through the collaboration of three specialized agents. The Coordinator, as the task scheduling and decision-making core, combines the user's high-level instruction and the current task state (provided by the State Tracker) to decompose the task into a clear atomic instruction. The Executor, acting as the tool, precisely executes this atomic instruction and interacts with the GUI environment. Finally, the State Tracker, as the memory, observes the Executor's output and updates it into a high-semantic task state summary, which is then fed back to the Coordinator for the next step of decision-making.
  • Figure 4: Our proposed staged execution-feedback RL strategy. This strategy utilizes the Execution-Feedback Reward from a fixed Executor to sequentially optimize the Coordinator (Stage 1) and State Tracker (Stage 2) in two independent training phases.
  • Figure 5: Failure Case Analysis. Compared to the baseline, our CES framework almost completely eliminates cognitive errors like State Loss and Planning Error.
  • ...and 7 more figures