Table of Contents
Fetching ...

VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

TL;DR

The paper tackles the cost and instability of environment-based RL for GUI agents and the generalization gaps of environment-free methods. It introduces Value Environment Model (VEM), a two-stage offline framework that first learns a frozen $Q_\theta(s,a)$ from human-guided annotations and then guides policy optimization with this fixed value signal via PPO, avoiding environment interactions. On Android-in-the-Wild, the approach achieves state-of-the-art results for environment-free methods and matches environment-based performance while reducing interaction costs. Theoretical analysis shows near-optimality under reasonable coverage and approximation conditions, and experiments demonstrate improved training stability and robust generalization across GUI layouts. Overall, semantic, value-driven estimation enables efficient, layout-agnostic GUI automation with practical impact for scalable GUI task automation.

Abstract

Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

TL;DR

The paper tackles the cost and instability of environment-based RL for GUI agents and the generalization gaps of environment-free methods. It introduces Value Environment Model (VEM), a two-stage offline framework that first learns a frozen from human-guided annotations and then guides policy optimization with this fixed value signal via PPO, avoiding environment interactions. On Android-in-the-Wild, the approach achieves state-of-the-art results for environment-free methods and matches environment-based performance while reducing interaction costs. Theoretical analysis shows near-optimality under reasonable coverage and approximation conditions, and experiments demonstrate improved training stability and robust generalization across GUI layouts. Overall, semantic, value-driven estimation enables efficient, layout-agnostic GUI automation with practical impact for scalable GUI task automation.

Abstract

Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

Paper Structure

This paper contains 28 sections, 1 theorem, 5 equations, 9 figures, 4 tables.

Key Result

Theorem 3.1

Let $\widehat{\pi}$ maximizes $\mathcal{J}(\pi) = \mathbb{E}_{s\sim\mathcal{D},\,a\sim\pi(\cdot\mid s)}[Q_\theta(s,a)]$. Under coverage and approximation conditions: Then there exists a constant $c>0$ (depending on $\gamma$ and the horizon) such that Moreover, because $\widehat{\pi}$ queries a fixed Q-function from a static dataset, the variance of its gradient estimates can be significantly low

Figures (9)

  • Figure 1: Two GUI tasks with the action marked as a red dot ($\bullet$). Although we do not see the next state after taking the action, we can estimate the state-action value. In (a), clicking the 'share' button is unlikely to reveal reviews, implying a low state-action value. In (b), the action may open the calendar app, making it a plausible step to display the week's events, suggesting a high state-action value.
  • Figure 2: VEM Architecture: (1) Offline dataset annotation using GPT-4o's task understanding, and VEM training via supervised regression. (2) Policy optimization through frozen VEM maximization, encouraging the policy model to explore high-value actions.
  • Figure 3: Action space exploration patterns demonstrated by the value model during policy training.
  • Figure 4: Q-value loss progression during policy model training.
  • Figure 5: A case study of task execution trajectory comparison with DigiRL.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Extended Performance Bound
  • proof : Proof