Table of Contents
Fetching ...

UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics

Mengzhou Wu, Yuzhe Guo, Yuan Cao, Haochuan Lu, Songhe Zhu, Pingzhe Qu, Xin Chen, Kang Qin, Zhongpu Wang, Xiaode Zhang, Xinyi Wang, Wei Dai, Gang Cao, Yuetang Deng, Zhi Gong, Dezhi Ran, Linyi Li, Wei Yang, Tao Xie

Abstract

Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback. Through a systematic investigation of self-supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI-Oceanus leverages this insight by converting low-cost autonomous exploration, which is verified directly by system execution, into high-density generative supervision to construct a robust internal world model. Experimental evaluations across a series of models demonstrate the decisive superiority of our approach: models utilizing Continual Pre-Training (CPT) on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real-world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume. These results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with robust cross-domain adaptability and compositional generalization.

UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics

Abstract

Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback. Through a systematic investigation of self-supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI-Oceanus leverages this insight by converting low-cost autonomous exploration, which is verified directly by system execution, into high-density generative supervision to construct a robust internal world model. Experimental evaluations across a series of models demonstrate the decisive superiority of our approach: models utilizing Continual Pre-Training (CPT) on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real-world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume. These results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with robust cross-domain adaptability and compositional generalization.

Paper Structure

This paper contains 71 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Constructing Generalist GUI Agents via Scalable World Model Learning. (Top) We first establish a robust physical foundation by learning a forward dynamics world model from massive, autonomously explored transitions. (Bottom) We then leverage this internalized world model to instantiate a generalist GUI agent through agentic post-training.
  • Figure 2: Overview of the proposed UI-Oceanus framework. UI-Oceanus consists of four sequential stages: (1) Scalable Acquisition, which autonomously explores diverse GUI applications to generate large-scale raw interaction trajectories; (2) Multi-Step Data Filtering Pipeline, which systematically filters and deduplicates raw interactions based on structural, visual, and semantic criteria; (3) Grounded Instruction Generation, which synthesizes multimodal instructions by interpreting transitions grounded in actual environmental feedback; and (4) Training Implementation, which employs forward dynamics for continual pre-training of the world model, followed by agentic post-training to finalize the GUI agent.
  • Figure 3: Scaling behavior of Qwen3-VL series models.
  • Figure 4: Training Loss Comparison. Inverse Dynamics (orange) exhibits rapid saturation, indicating insufficient task difficulty. In contrast, Forward Dynamics (blue) maintains a higher loss level, providing the sustained gradient signal necessary for effective representation learning.