Table of Contents
Fetching ...

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

Runliang Niu, Jinglong Ji, Yi Chang, Qi Wang

TL;DR

ScreenExplorer presents a vision-language agent trained with Group Relative Policy Optimization in real, dynamic GUI environments. It couples a world-model-based curiosity reward with GRPO and an experience stream distillation pipeline to enable both effective interaction and diverse exploration, addressing cold-start and data-efficiency challenges. The approach demonstrates improved exploration diversity and GUI adaptation for a 3B-parameter model, and its distillation loop offers a path toward continual self-improvement with reduced reliance on manually curated data. Overall, the work provides a scalable framework for self-improving, open-world GUI agents with potential implications for advancing toward AGI in interactive settings.

Abstract

The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

TL;DR

ScreenExplorer presents a vision-language agent trained with Group Relative Policy Optimization in real, dynamic GUI environments. It couples a world-model-based curiosity reward with GRPO and an experience stream distillation pipeline to enable both effective interaction and diverse exploration, addressing cold-start and data-efficiency challenges. The approach demonstrates improved exploration diversity and GUI adaptation for a 3B-parameter model, and its distillation loop offers a path toward continual self-improvement with reduced reliance on manually curated data. Overall, the work provides a scalable framework for self-improving, open-world GUI agents with potential implications for advancing toward AGI in interactive settings.

Abstract

The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.

Paper Structure

This paper contains 32 sections, 6 equations, 30 figures, 6 tables, 1 algorithm.

Figures (30)

  • Figure 1: ScreenExplorer-3B-E1's RL training leads to better GUI exploration diversity versus static models.
  • Figure 2: Framework overview: (a)We run $M$ parallel environments for $T$ steps per episode. At each step, the VLM takes state $s$ and outputs an intent $i$ and action $a$, the environment returns the post-action state $s'$, and the world model predicts the next state $\hat{s}$. All transitions are stored in a rollout buffer, where a reward function computes an exploration reward for each action. The VLM is then updated via GRPO, while the world model learns transitions by minimizing reconstruction error; (b)The reward function consists of nine terms that enforce correct action formatting, encourage large state changes, and align intents with observed states.
  • Figure 3: Examples of Trajectories from ScreenExplorer-3B-E1. Through RL training, the model developed increasingly effective interactions with the environment, enabling exploration of deeper pages.
  • Figure 4: Indicators of ScreenExplorer-3B-E1 in training. During RL training, the VLM actor's increasing rewards show improved environment interaction and state space exploration. The world model loss demonstrates sustained curiosity that drives further exploration.
  • Figure 5: Rewards and metrics of exploration diversity in world model ablation study.
  • ...and 25 more figures