Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

Nicola Dainese; Matteo Merler; Minttu Alakuijala; Pekka Marttinen

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

Nicola Dainese, Matteo Merler, Minttu Alakuijala, Pekka Marttinen

TL;DR

The paper proposes Code World Models (CWMs), where large language models generate executable Python code to model environment dynamics for model-based RL. It introduces GIF-MCTS, a Monte Carlo Tree Search–guided loop that generates, improves, and fixes code using unit-test feedback from environment trajectories, enabling robust offline synthesis of CWMs. A new Code World Models Benchmark (CWMB) with 18 environments demonstrates that GIF-MCTS achieves state-of-the-art performance on code synthesis and planning benchmarks, delivering substantial gains in planning efficiency (orders of magnitude faster than LLM-only planning) and improved sample efficiency. The results highlight the potential of combining LLMs with structured code generation for fast, interpretable planning, while acknowledging limitations related to determinism, data quality, and generalization to more complex or stochastic environments.

Abstract

In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has potential to be more precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach in an offline RL setting, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

TL;DR

Abstract

Paper Structure (40 sections, 6 equations, 15 figures, 14 tables)

This paper contains 40 sections, 6 equations, 15 figures, 14 tables.

Introduction
Related Work
Code World Models
GIF-MCTS
GIF-MCTS Actions
Experiments
Baselines
APPS
Code World Models Benchmark
Read to Fight Monsters
Discussion
Limitations
Conclusion
Broader Impact
Additional GIF-MCTS implementation details
...and 25 more sections

Figures (15)

Figure 1: Overview of the Code World Models (CWM) framework. Given the description of an environment and a task, we use an LLM guided by the GIF-MCTS method to iteratively generate and refine a candidate CWM. The candidate's correctness is evaluated by checking if it correctly predicts a set of trajectories collected from the true environment. If the model cannot fully predict all transitions, the fraction of correct predictions and other information are given as feedback to the LLM and the cycle repeats. After matching all transitions or having used up a computational budget, the best CWM is returned and used to solve the task via model-based planning.
Figure 2: Example of a GIF-MCTS tree for generating a CWM. Starting from the root of the tree, every action taken corresponds to 1) prompting the LLM to either generate, improve or fix a CWM, 2) parsing the LLM completion, and 3) evaluating the CWM's correctness using the available environment trajectories as unit tests (presented as a percentage inside the nodes). On buggy nodes, we allow only fix actions for up to $f$ sequential attempts and replace the actual value with a temporary one, represented in red. In healthy nodes we allow only generate and improve actions. All action prompts are exemplified on the right. The number of total fix$f$ attempts is a model hyperparameter, set to three in this Figure and for our method.
Figure 3: Prompt on the APPS benchmark for the generate action.
Figure 4: Prompt on the APPS benchmark for the improve action.
Figure 5: Prompt on the APPS benchmark for the fix action.
...and 10 more figures

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

TL;DR

Abstract

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

Authors

TL;DR

Abstract

Table of Contents

Figures (15)