Table of Contents
Fetching ...

World Models for Policy Refinement in StarCraft II

Yixin Zhang, Ziyi Wang, Yiming Rong, Haoxi Wang, Jinling Jiang, Shuang Xu, Haoran Wu, Shiyu Zhou, Bo Xu

TL;DR

StarWM introduces the first action-conditioned world model for StarCraft II to enable lookahead under partial observability. It employs a structured five-module textual observation representation and provides SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction, plus a multi-dimensional offline evaluation framework. Integrating StarWM into a Generate--Simulate--Refine loop as StarWM-Agent yields consistent online gains against built-in AI across multiple difficulty levels and enhances macro-management stability and tactical risk assessment. The work demonstrates that learnable world models can supply foresight for policy refinement in complex RTS environments, bridging LLM-based decision making and forward simulation.

Abstract

Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action-conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2's hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction. We further develop a multi-dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM's substantial gains over zero-shot baselines, including nearly 60% improvements in resource prediction accuracy and self-side macro-situation consistency. Finally, we propose StarWM-Agent, a world-model-augmented decision system that integrates StarWM into a Generate--Simulate--Refine decision loop for foresight-driven policy refinement. Online evaluation against SC2's built-in AI demonstrates consistent improvements, yielding win-rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro-management stability and tactical risk assessment.

World Models for Policy Refinement in StarCraft II

TL;DR

StarWM introduces the first action-conditioned world model for StarCraft II to enable lookahead under partial observability. It employs a structured five-module textual observation representation and provides SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction, plus a multi-dimensional offline evaluation framework. Integrating StarWM into a Generate--Simulate--Refine loop as StarWM-Agent yields consistent online gains against built-in AI across multiple difficulty levels and enhances macro-management stability and tactical risk assessment. The work demonstrates that learnable world models can supply foresight for policy refinement in complex RTS environments, bridging LLM-based decision making and forward simulation.

Abstract

Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action-conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2's hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction. We further develop a multi-dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM's substantial gains over zero-shot baselines, including nearly 60% improvements in resource prediction accuracy and self-side macro-situation consistency. Finally, we propose StarWM-Agent, a world-model-augmented decision system that integrates StarWM into a Generate--Simulate--Refine decision loop for foresight-driven policy refinement. Online evaluation against SC2's built-in AI demonstrates consistent improvements, yielding win-rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro-management stability and tactical risk assessment.
Paper Structure (59 sections, 11 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 59 sections, 11 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Case study comparing our world-model-augmented decision system (StarWM-Agent) with a policy that does not use a world model. Given the current observation, the LLM policy initially proposes Build Supply Depot. A 5-second rollout by the world model predicts that minerals will drop to 50 and the supply depot will be 23% complete, while unused supply remains 18. Based on this prediction, the system revises the action to Train SCV, avoiding premature infrastructure expenditure that would lead to mineral shortage. This example illustrates that incorporating a world model can improve macro-management decision-making.
  • Figure 2: Framework of our StarWM-Agent, which follows a Generate--Simulate--Refine loop: the policy first generates an initial action proposal from the current observation, the world model predicts the short-horizon future observation, and the policy then refines the action conditioned on the predicted future.
  • Figure 3: Evolution of Macro-Situation Metric (AWD) over game time. Left: Self-side entities. Right: Enemy-side entities. The green area indicates where StarWM outperforms the zero-shot Qwen3-32B baseline.
  • Figure 4: Offline case study. Left: Qwen3-8B. Middle: Qwen3-32B. Right: StarWM. Circles and squares denote units and structures, respectively. Filled markers indicate ground truth, while hollow markers represent predictions. StarWM exhibits stronger spatial consistency with the ground truth, reflecting more accurate action-conditioned movement prediction.
  • Figure 5: Offline case study on scouting. When self units enter unobservable regions, StarWM predicts potential enemy presence (red hollow circles) within those areas, illustrating a data-driven statistical predictive pattern.
  • ...and 1 more figures