Table of Contents
Fetching ...

What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking

Yuan Sui, Yanming Zhang, Yi Liao, Yu Gu, Guohua Tang, Zhongqian Sun, Wei Yang, Bryan Hooi

TL;DR

WiA-LLM introduces What-If Analysis for large language models by building an explicit, language-based world model to forecast the consequences of actions in a MOBA environment. The approach combines supervised fine-tuning on reasoning traces with reinforcement learning using rule-based, verifiable rewards to align forecasts with real dynamics. A lookahead-based inference mechanism enables model-based planning, improving strategic behavior and forecasting accuracy in Honor of Kings. Experiments demonstrate strong performance across varying difficulty levels and show that proactive forecasting enhances forward-looking decision-making while preserving core language capabilities. The work advances interpretable, generalizable planning for LLMs in dynamic, partially observable environments, with practical deployment considerations discussed.

Abstract

Large Language Models (LLMs) are effective at reasoning and information retrieval, but remain unreliable for decision-making in dynamic, partially observable, high-stakes environments such as MOBA games. One key limitation is weak counterfactual reasoning: LLMs struggle to conduct precise what-if analysis over candidate actions and their future consequences. We address this limitation with What-if Analysis LLM (WiA-LLM), a framework that trains an LLM as an explicit language-based world model. Instead of representing the environment in latent vectors, WiA-LLM models how the game state evolves over time with candidate actions using language, and provides textual justifications for these predicted outcomes. This explicit modeling supports (1) interpretability, since the model's predictions and underlying rationales are human-readable, and (2) semantic generalization, as the model can transfer knowledge across situations that share similar game concepts (e.g., roles, objectives, or tactics). WiA-LLM is trained in two stages: supervised fine-tuning on human-like reasoning traces, followed by reinforcement learning with outcome-based rewards that depend on the discrepancy between predicted and ground-truth future states. In the Honor of Kings (HoK) environment, WiA-LLM attains 74.2\% accuracy (27\%$\uparrow$ vs. base model) in forecasting game-state changes. In addition, we find that agents with WiA-LLM exhibit closer strategic behavior to expert players than purely reactive LLM agents, indicating more foresight-aware and expert-aligned decision-making.

What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking

TL;DR

WiA-LLM introduces What-If Analysis for large language models by building an explicit, language-based world model to forecast the consequences of actions in a MOBA environment. The approach combines supervised fine-tuning on reasoning traces with reinforcement learning using rule-based, verifiable rewards to align forecasts with real dynamics. A lookahead-based inference mechanism enables model-based planning, improving strategic behavior and forecasting accuracy in Honor of Kings. Experiments demonstrate strong performance across varying difficulty levels and show that proactive forecasting enhances forward-looking decision-making while preserving core language capabilities. The work advances interpretable, generalizable planning for LLMs in dynamic, partially observable environments, with practical deployment considerations discussed.

Abstract

Large Language Models (LLMs) are effective at reasoning and information retrieval, but remain unreliable for decision-making in dynamic, partially observable, high-stakes environments such as MOBA games. One key limitation is weak counterfactual reasoning: LLMs struggle to conduct precise what-if analysis over candidate actions and their future consequences. We address this limitation with What-if Analysis LLM (WiA-LLM), a framework that trains an LLM as an explicit language-based world model. Instead of representing the environment in latent vectors, WiA-LLM models how the game state evolves over time with candidate actions using language, and provides textual justifications for these predicted outcomes. This explicit modeling supports (1) interpretability, since the model's predictions and underlying rationales are human-readable, and (2) semantic generalization, as the model can transfer knowledge across situations that share similar game concepts (e.g., roles, objectives, or tactics). WiA-LLM is trained in two stages: supervised fine-tuning on human-like reasoning traces, followed by reinforcement learning with outcome-based rewards that depend on the discrepancy between predicted and ground-truth future states. In the Honor of Kings (HoK) environment, WiA-LLM attains 74.2\% accuracy (27\% vs. base model) in forecasting game-state changes. In addition, we find that agents with WiA-LLM exhibit closer strategic behavior to expert players than purely reactive LLM agents, indicating more foresight-aware and expert-aligned decision-making.

Paper Structure

This paper contains 30 sections, 6 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of reasoning paradigms: (a) reactive thinking, where the model selects an action given the current game state; (b) proactive thinking, where the model also forecasts the consequences of candidate actions on future game states. In this work, we focus on proactive thinking and train models to forecast the consequences of different actions.
  • Figure 2: Workflow of WiA-LLM. Given the current game state and a set of hypothetical actions, the model is asked to forecast the potential changes to the entire game state once the player takes the action, and provide the justification for the forecasts. We then use the predicted game state changes to compare with ground-truth values using a rule-based verifier to update the policy model. This process enables the model to perform what-if analysis (forecasting) by simulating action outcomes and refining its decision-making accordingly.
  • Figure 3: Demonstration of rewards (left) & total token length (right) change over training steps during the RL training process. It indicates that models with our method consistently achieve higher rewards and maintain more stable or longer token lengths compared to the baselines, demonstrating improved learning efficiency and output quality.
  • Figure 4: Distribution of sample counts across different accuracy ranges. The stacked histograms demonstrate sample counts across accuracy ranges, while the smoothed trend line highlights the performance pattern of our method.
  • Figure 5: Case study on WiA. To safeguard user privacy, we blurred the user's Game ID.
  • ...and 2 more figures