Table of Contents
Fetching ...

Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin

TL;DR

Code2World introduces a renderable-code GUI world model that predicts the next UI state by generating HTML and rendering it, addressing data scarcity with the AndroidCode dataset of over 80K samples refined through visual feedback. It trains in two stages—supervised fine-tuning for HTML syntax followed by Render-Aware Reinforcement Learning with dual rewards $R_{sem}$ and $R_{act}$—and evaluates via a VLM-as-judge framework, achieving state-of-the-art next UI prediction and significant downstream navigation gains. The work demonstrates strong generalization to unseen apps and offers a plug-and-play simulator that enhances offline and online GUI navigation, with code released to enable broader adoption. Overall, Code2World shifts GUI world modeling from pixel or pure semantic representations to renderable code, providing both visual fidelity and fine-grained structural controllability for safer, foresighted agent planning.

Abstract

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.

Code2World: A GUI World Model via Renderable Code Generation

TL;DR

Code2World introduces a renderable-code GUI world model that predicts the next UI state by generating HTML and rendering it, addressing data scarcity with the AndroidCode dataset of over 80K samples refined through visual feedback. It trains in two stages—supervised fine-tuning for HTML syntax followed by Render-Aware Reinforcement Learning with dual rewards and —and evaluates via a VLM-as-judge framework, achieving state-of-the-art next UI prediction and significant downstream navigation gains. The work demonstrates strong generalization to unseen apps and offers a plug-and-play simulator that enhances offline and online GUI navigation, with code released to enable broader adoption. Overall, Code2World shifts GUI world modeling from pixel or pure semantic representations to renderable code, providing both visual fidelity and fine-grained structural controllability for safer, foresighted agent planning.

Abstract

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.
Paper Structure (39 sections, 11 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 11 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of Code2World. Given a current GUI observation and an action, Code2World predicts the next screenshot via renderable code generation.
  • Figure 2: Left: Illustration of Data Synthesis. The high-fidelity AndroidCode dataset is curated via constrainted initial synthesis and a visual-feedback revision loop, where synthesized HTML is iteratively refined based on rendered visual discrepancies to ensure strict alignment (SigLIP score $>$ 0.9). Right: Two-stage Model Optimization. The pipeline progresses from an SFT cold start to Render-Aware Reinforcement Learning (RARL). Utilizing Group Relative Policy Optimization (GRPO), the model optimizes dual rewards—visual semantic ($R_{\text{sem}}$) and action consistency ($R_{\text{act}}$)—derived directly from rendered outcomes to enforce structural and logical fidelity.
  • Figure 3: Illustration of the "Propose, Simulate, Select" pipeline for Code2World enhanced GUI agent, exemplified by an AndroidWorld task androidworld. (1) Propose: The GUI agent generates $K$ candidate actions, with red and green highlighting hallucinated/irrational reasoning and logically sound reasoning, respectively. (2) Simulate: Code2World predicts the execution result of each candidate via renderable code generation. (3) Select: By evaluating the rendered future states, the system identifies the potential failure in the original policy and rectifies the decision, ultimately selecting the optimal action that aligns with the user's intent.
  • Figure 4: Qualitative comparison of next GUI state generation over Code2World and three baselines. The red circle in origin state indicates the user's click position targeting the search bar.
  • Figure 5: Performance comparison on the AndroidWorld.
  • ...and 5 more figures