Table of Contents
Fetching ...

OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding

TL;DR

OS-Oracle introduces a cross-platform GUI critic framework that closes data and benchmarking gaps for step-level evaluation in computer-using agents. It fuses a scalable data pipeline ( generating ~310k samples ), a two-stage training regime (SFT + CP-GRPO) to align reasoning with decisions, and OS-Critic Bench to evaluate cross-platform performance. The resulting OS-Oracle-7B achieves state-of-the-art results among open-source VLMs and boosts native GUI agents when used as a pre-critic, demonstrating practical gains in long-horizon GUI tasks. By releasing open-source code, OS-Oracle provides a complete pathway for building robust, cross-platform GUI critics and improving real-world task success rates.

Abstract

With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at https://github.com/numbmelon/OS-Oracle.

OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

TL;DR

OS-Oracle introduces a cross-platform GUI critic framework that closes data and benchmarking gaps for step-level evaluation in computer-using agents. It fuses a scalable data pipeline ( generating ~310k samples ), a two-stage training regime (SFT + CP-GRPO) to align reasoning with decisions, and OS-Critic Bench to evaluate cross-platform performance. The resulting OS-Oracle-7B achieves state-of-the-art results among open-source VLMs and boosts native GUI agents when used as a pre-critic, demonstrating practical gains in long-horizon GUI tasks. By releasing open-source code, OS-Oracle provides a complete pathway for building robust, cross-platform GUI critics and improving real-world task success rates.

Abstract

With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at https://github.com/numbmelon/OS-Oracle.

Paper Structure

This paper contains 21 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An example of CUA automation with OS-Orcale-7B. The critic model analyzes the current task, interaction history, the CUA's proposed next action, and the current screenshot, and then judges whether the action is correct.
  • Figure 2: The framework of OS-Oracle. (Left) Data Synthesis. We first use GPT-based filtering to extract step-wise positive samples from raw trajectories, and then construct four types of negative samples from the positive ones: Operation Failure (OF), Inefficient Error State Recovery (IESR), Mistimed Task Terminatio (MTT), and Inaccurate Element Localization (IUEL). (Right) Two-stage training: supervised fine-tuning (SFT) on synthesized data followed by CP-GRPO training with accuracy, format, and consistency-preserving rewards.
  • Figure 3: The performance of UI-TARS-1.5-7B with critic model on AndroidWorld and OSWorld.
  • Figure 4: Effect of SFT data scaling on overall performance.
  • Figure 5: The action distribution of OS-Critic Bench.
  • ...and 3 more figures