VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
TL;DR
The paper tackles the cost and instability of environment-based RL for GUI agents and the generalization gaps of environment-free methods. It introduces Value Environment Model (VEM), a two-stage offline framework that first learns a frozen $Q_\theta(s,a)$ from human-guided annotations and then guides policy optimization with this fixed value signal via PPO, avoiding environment interactions. On Android-in-the-Wild, the approach achieves state-of-the-art results for environment-free methods and matches environment-based performance while reducing interaction costs. Theoretical analysis shows near-optimality under reasonable coverage and approximation conditions, and experiments demonstrate improved training stability and robust generalization across GUI layouts. Overall, semantic, value-driven estimation enables efficient, layout-agnostic GUI automation with practical impact for scalable GUI task automation.
Abstract
Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.
