GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang; Qianhui Wu; Zhaoyang Wang; Hanyang Chen; Ke Yang; Hao Cheng; Huaxiu Yao; Baoling Peng; Huan Zhang; Jianfeng Gao; Tong Zhang

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

TL;DR

This work presents GUI-Libra, a tailored training recipe that consistently improves both step-wise accuracy and end-to-end task completion and introduces success-adaptive scaling to downweight unreliable negative gradients.

Abstract

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

TL;DR

Abstract

Paper Structure (84 sections, 5 theorems, 56 equations, 17 figures, 19 tables)

This paper contains 84 sections, 5 theorems, 56 equations, 17 figures, 19 tables.

Introduction
Related Work
Datasets for Training GUI Agents
VLM Post-training for GUI Agents
Preliminaries
VLM-based GUI agents.
High-level vs. low-level GUI tasks.
Post-training for GUI Models.
Reasoning Data Curation for GUI Agents
Data Curation and Filtering Pipeline
Data Sources
Unified Structured Format
Action-aligned Reasoning Augmentation
Data Filtering for SFT
SFT Dataset Statistics.
...and 69 more sections

Key Result

theorem 1

Offline-to-online bound under partial verifiabilityoff2on_main Assume Assumption ass:failure and that, for all $t\in[H]$, $d_{\pi,t}(s)>0$ implies $d_\mu(s)>0$ (i.e., $\mathrm{supp}(d_{\pi,t})\subseteq \mathrm{supp}(d_\mu)$). This condition ensures the occupancy ratio $C(\pi)$ is well-defined. Then In particular, if $C(\pi)$ is uniformly bounded over a policy class and $\bar{\eta}_\pi$ is small o

Figures (17)

Figure 1: Overview of GUI-Libra. Using only a subset of existing open-source GUI trajectories, we tackle key limitations of prior training pipelines through action-aligned reasoning data curation, action-aware SFT, and conservative RL, yielding consistent gains on online benchmarks.
Figure 2: Example data format in GUI-Libra-81K. Each sample includes the current visual observation (screenshot) and textual context (system prompt, user instruction, and interaction history/previous actions). The model output is split into (1) a CoT reasoning trace and (2) a structured executable action (JSON), specifying the action type, a brief action description, the target element (if available), and action arguments such as text values or coordinates.
Figure 3: (a)(b) Data source distribution for SFT and RL. (c) Action type distribution of GUI-Libra-81K. (d) Comparison of step index distributions between our SFT and RL datasets.
Figure 4: (a) Grounding accuracy on ScreenSpot-v2 versus response length for base models and CoT-SFT models, showing that overly long responses correlate with degraded grounding. (b) Average grounding accuracy under different SFT strategies, where excessively long reasoning traces lead to a substantial drop.
Figure 5: Overall training framework of GUI-Libra: Stage 1 applies action-aware SFT with mixed supervision and token reweighting; Stage 2 performs KL-regularized GRPO with success-adaptive negative gradient scaling.
...and 12 more figures

Theorems & Definitions (9)

definition 1
theorem 1
corollary 1
theorem 2
proof
lemma 1
proof
lemma 2
proof

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

TL;DR

Abstract

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (9)