Table of Contents
Fetching ...

GPA: Learning GUI Process Automation from Demonstrations

Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese, Junnan Li

Abstract

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

GPA: Learning GUI Process Automation from Demonstrations

Abstract

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

Paper Structure

This paper contains 39 sections, 15 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Two phases of GPA: demonstration phase (a) and execution phase (b). In the demonstration phase, a user interacts with the local desktop environment while the recorder captures screenshots and actions, parses them into step subgraphs, and applies LLM-based post-processing for cleanup and variable extraction. In the execution phase, GPA receives the latest observation from the local computer, loads the current recorded step, parses the screenshot into a UI graph, and combines the parsed graph, recorded workflow graph, and action in SMC localization. A finite state machine orchestrates all the components, checking readiness of each step, control execution, and handle errors.
  • Figure 2: Example UI graph with detected elements and edges. The highlighted node is the target, and nearby nodes helps localizing the target.
  • Figure 3: SMC likelihood visualization. The target to click is the checkbox next to "Edit". (a) shows the likelihood of directly matching the target checkbox. The pointwise maximum over candidates forms the likelihood surface, showing high ambiguity since matching candidates are all identical. (b) shows the likelihood of matching the neighbor nodes of target, which helps us to differentiate the checkbox. (c) shows the joint likelihood, combining the results of (a) and (b).
  • Figure 4: Mixture prior over scale $s$: one mode favors no resizing ($s=1$), and the other favors proportional rescaling.
  • Figure 5: Process of SMC. The blue points are particles that gradually converge to the target prediction.
  • ...and 2 more figures