BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism
Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan
TL;DR
BacktrackAgent tackles the problem of error recovery in GUI agents by introducing a backtracking framework that jointly uses a rule-based Verifier and a model-based Judger to detect errors and a Reflector to recover from them. The approach trains Generator, Judger, and Reflector with cross-entropy losses and two auxiliary reward signals from the error detectors, incorporating both actual and simulated action executions to teach robust recovery. Evaluations on Mobile3M and Auto-UI show notable gains in task success rate and both task- and step-level accuracies, outperforming strong baselines like ReachAgent and MobileVLM, and ablations demonstrate the value of each component and the superiority of using actual execution outcomes. The work contributes a principled backtracking mechanism, specialized judgment and reflection datasets, and thorough empirical analysis, offering a practical path to more reliable GUI automation systems.
Abstract
Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent's performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.
