Table of Contents
Fetching ...

BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism

Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan

TL;DR

BacktrackAgent tackles the problem of error recovery in GUI agents by introducing a backtracking framework that jointly uses a rule-based Verifier and a model-based Judger to detect errors and a Reflector to recover from them. The approach trains Generator, Judger, and Reflector with cross-entropy losses and two auxiliary reward signals from the error detectors, incorporating both actual and simulated action executions to teach robust recovery. Evaluations on Mobile3M and Auto-UI show notable gains in task success rate and both task- and step-level accuracies, outperforming strong baselines like ReachAgent and MobileVLM, and ablations demonstrate the value of each component and the superiority of using actual execution outcomes. The work contributes a principled backtracking mechanism, specialized judgment and reflection datasets, and thorough empirical analysis, offering a practical path to more reliable GUI automation systems.

Abstract

Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent's performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.

BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism

TL;DR

BacktrackAgent tackles the problem of error recovery in GUI agents by introducing a backtracking framework that jointly uses a rule-based Verifier and a model-based Judger to detect errors and a Reflector to recover from them. The approach trains Generator, Judger, and Reflector with cross-entropy losses and two auxiliary reward signals from the error detectors, incorporating both actual and simulated action executions to teach robust recovery. Evaluations on Mobile3M and Auto-UI show notable gains in task success rate and both task- and step-level accuracies, outperforming strong baselines like ReachAgent and MobileVLM, and ablations demonstrate the value of each component and the superiority of using actual execution outcomes. The work contributes a principled backtracking mechanism, specialized judgment and reflection datasets, and thorough empirical analysis, offering a practical path to more reliable GUI automation systems.

Abstract

Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent's performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.

Paper Structure

This paper contains 40 sections, 7 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Previous works often struggle to recover from errors, whereas BacktrackAgent utilizes a backtracking mechanism to recover from erroneous pages.
  • Figure 2: A ten-step GUI trajectory for ordering coffee. The red arrow indicates that the current page is identified as an error page, requiring a backtrack to the previous page in order to regenerate the necessary action. Action $a_1$ is an abbreviation for "click(delivery,[375,740][704,1032])". The detailed information is summarized in Figure \ref{['figure2-2']}.
  • Figure 3: The overview of BacktrackAgent. The left part shows the detailed process of an action $a^i_t$ being judged as an error by the error detection module and reflected by the error recovery module. The right part shows the pipeline of the agent generating GUI trajectories through action generation, error detection, and error recovery modules.
  • Figure 4: The action result pages generated by actual execution and simulated execution.
  • Figure 5: Box plot shows the performance improvement (%) of repeated experiments compared to ReachAgent.
  • ...and 2 more figures