Table of Contents
Fetching ...

STEVE: A Step Verification Pipeline for Computer-use Agent Training

Fanbin Lu, Zhisheng Zhong, Ziqin Wei, Shu Liu, Chi-Wing Fu, Jiaya Jia

TL;DR

STEVE tackles the data-hungry problem of training computer-use agents by introducing a step verification pipeline that uses GPT-4o to generate dense, stepwise rewards from trajectory data. By integrating a high-capacity UI-grounding vision-language model with Kahneman-Tversky Optimization, the approach leverages both positive and negative step signals to train a single agent capable of low-level UI perception and high-level planning. Empirical results demonstrate that a 7B STEVE-KTO agent outperforms supervised finetuning on multiple GUI benchmarks and achieves state-of-the-art performance in live Windows OS environments at reduced cost, underscoring scalability and practicality for desktop automation. Overall, STEVE provides a scalable blueprint for leveraging stepwise, model-based verification to train capable computer-use agents in real-world GUI tasks.

Abstract

Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.

STEVE: A Step Verification Pipeline for Computer-use Agent Training

TL;DR

STEVE tackles the data-hungry problem of training computer-use agents by introducing a step verification pipeline that uses GPT-4o to generate dense, stepwise rewards from trajectory data. By integrating a high-capacity UI-grounding vision-language model with Kahneman-Tversky Optimization, the approach leverages both positive and negative step signals to train a single agent capable of low-level UI perception and high-level planning. Empirical results demonstrate that a 7B STEVE-KTO agent outperforms supervised finetuning on multiple GUI benchmarks and achieves state-of-the-art performance in live Windows OS environments at reduced cost, underscoring scalability and practicality for desktop automation. Overall, STEVE provides a scalable blueprint for leveraging stepwise, model-based verification to train capable computer-use agents in real-world GUI tasks.

Abstract

Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.

Paper Structure

This paper contains 22 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Windows File Explorer task completion rate of different computer-use agents: (i) Our powerful GUI grounding model achieves the current best task completion rate, setting a promising upper bound for computer-use agent finetuning. (ii) Using STEVE, our step verification pipeline, we are able to train our agents with KTO (red), which consistently outperforms (iii) the supervised finetuning (SFT). Notably, with increased computer operating time (x-axis), our 7B KTO agent is able to outperform the OmniParser with the GPT-4o planner.
  • Figure 2: Datasets we collected for UI-grounding model training, including open-source datasets and an additional private Windows OS dataset created by ourselves to enhance the model's performance on Windows.
  • Figure 3: Overview of STEVE, the step verification pipeline. We first create a large number of feasible tasks from the seed tasks to scale up the quality and diversity of agent tasks. Then we deploy our computer-use agent in desktop environments to sample trajectory data. A GPT-4o judge is used to verify the quality of each step in the trajectory, resulting in a large process reward dataset for agent training.
  • Figure 4: Percentage consistency between human judges and the GPT-4o step verifier. We split all the positive and negative actions into early (step ID $\le 7$) and late (step ID $> 7$) groups, resulting in four bars in the figure. For example, $92.3\%$ for the Early Pos. bar means the GPT-4o judge agrees with humans for $92.3\%$ of the early positive actions.
  • Figure 5: We show an ablation study of OmniParser, the SFT agent, and three KTO agents at three iterative rounds (SFT, R1, R2, and R3). The results are evaluated on three distinct task domains from the WinAgentArena benchmark. Yellow bars in the figures indicate that GPT-4o is employed as the task planner. The reported outcomes represent the average performance over five experimental runs.
  • ...and 6 more figures