Table of Contents
Fetching ...

Can Language Models Serve as Text-Based World Simulators?

Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, Peter Jansen

TL;DR

The paper investigates whether large language models can serve as text-based world simulators by introducing ByteSized32-State-Prediction, a dataset with $76{,}369$ transitions across $31$ text games, and formalizing the LLM-Sim task as a single-step predictor of next state, reward, and completion. It decomposes the world model into action-driven, environment-driven, and reward components and evaluates two output modes, Full State Prediction and State Difference Prediction, using in-context rules to influence performance. Experiments with GPT-4 show that single-step accuracy tops out around $59.9\%$, with action-driven transitions being easier (up to $77.1\%$) than environment-driven transitions (up to $49.7\%$), and humans still outperform the model by a notable margin; progress hinges on better handling arithmetic, commonsense, and domain knowledge for environmental dynamics. The results demonstrate that while LLMs hold promise as world simulators, they are not yet reliable without further innovations, and the ByteSized32-SP benchmark provides a concrete, scalable avenue to track future improvements.

Abstract

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

Can Language Models Serve as Text-Based World Simulators?

TL;DR

The paper investigates whether large language models can serve as text-based world simulators by introducing ByteSized32-State-Prediction, a dataset with transitions across text games, and formalizing the LLM-Sim task as a single-step predictor of next state, reward, and completion. It decomposes the world model into action-driven, environment-driven, and reward components and evaluates two output modes, Full State Prediction and State Difference Prediction, using in-context rules to influence performance. Experiments with GPT-4 show that single-step accuracy tops out around , with action-driven transitions being easier (up to ) than environment-driven transitions (up to ), and humans still outperform the model by a notable margin; progress hinges on better handling arithmetic, commonsense, and domain knowledge for environmental dynamics. The results demonstrate that while LLMs hold promise as world simulators, they are not yet reliable without further innovations, and the ByteSized32-SP benchmark provides a concrete, scalable avenue to track future improvements.

Abstract

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
Paper Structure (32 sections, 1 equation, 6 figures, 7 tables)

This paper contains 32 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: An overview of our two approaches using an LLM as a text game simulator. The example shows the process that a cup in the sink is filled by water after turning on the sink. The full state prediction includes all objects in the game including the unrelated stove, while the state difference prediction excludes the unrelated stove. State changes caused by $\mathcal{F}_{\text{act}\xspace}$ and $\mathcal{F}_{\text{env}\xspace}$ are highlighted in yellow and green, respectively.
  • Figure 2: Simulation performance of whole state transition (top), action-driven transitions (middle) and environment-driven transitions (bottom) as a function of the property being modified, in the GPT-4, full state prediction, with human written rules condition. The x-axis represents specific object properties, and y-axis represents performance (0-100%). Errors are broken down into incorrect value and unaltered value. Refer to Table \ref{['tab: property description']} for the meaning of each property.
  • Figure 3: GPT-4 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.
  • Figure 4: GPT-4 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.
  • Figure 5: GPT-3.5 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.
  • ...and 1 more figures