Table of Contents
Fetching ...

Generating Symbolic World Models via Test-time Scaling of Large Language Models

Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu

TL;DR

The paper tackles the difficulty of using LLMs for complex planning by proposing to generate explicit PDDL-based world models from natural-language prompts, enabling principled planning with classical search. It introduces a test-time compute scaling framework that combines Best-of-N sampling for diverse initializations with instance verbalized machine learning (iVML) for iterative refinement, all without fine-tuning. Across NL2Domain, Prob2Domain, and PDDL problem-generation benchmarks, BoN and iVML improve synthesis accuracy and planning reliability, achieving near-perfect performance in several settings and competitive results with oracle-based approaches. This approach offers a scalable, transparent path toward robust symbolic reasoning in LLM-driven planning, with implications for safety and verifiability in AI systems, while acknowledging limitations in semantic verification and idealized evaluation conditions.

Abstract

Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domains, achieving over 50\% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.

Generating Symbolic World Models via Test-time Scaling of Large Language Models

TL;DR

The paper tackles the difficulty of using LLMs for complex planning by proposing to generate explicit PDDL-based world models from natural-language prompts, enabling principled planning with classical search. It introduces a test-time compute scaling framework that combines Best-of-N sampling for diverse initializations with instance verbalized machine learning (iVML) for iterative refinement, all without fine-tuning. Across NL2Domain, Prob2Domain, and PDDL problem-generation benchmarks, BoN and iVML improve synthesis accuracy and planning reliability, achieving near-perfect performance in several settings and competitive results with oracle-based approaches. This approach offers a scalable, transparent path toward robust symbolic reasoning in LLM-driven planning, with implications for safety and verifiability in AI systems, while acknowledging limitations in semantic verification and idealized evaluation conditions.

Abstract

Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domains, achieving over 50\% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.

Paper Structure

This paper contains 28 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An overview of the proposed method. Our test-time compute scaling approach consists of two main steps: (1) Best-of-N Sampling for PDDL Initialization (see Section \ref{['text:bon']}): We start by running a parallel sampling process to generate multiple chain-of-thought responses that are composed of the formalized PDDL-based world model representation $\mathbf{D}_0^{(k)}$ and the natural language thought $\mathbf{T}_0^{(k)}$. (2) Closed-loop Iteration with iVML (see Section \ref{['text:ivml']}): We use iVML to iteratively improve the solutions. iVML incorporates: (1) An optimizer LLM $f_\mathrm{opt}$ that evaluates the solutions from the previous iteration, and (2) A learner LLM $f_\mathrm{update}$ that learns from the feedback and updates the PDDL-based world model $\mathbf{D}_i$. Here, $N$ represents the total number of candidate solutions generated, $k$ is the index of the top candidates retained for further optimization (with $K$ indicating the total number of such candidates), and the index $i$ is used to denote the iteration step within the optimization procedure. The optimal PDDL-based world model will be used in the systematic search engine for planning.
  • Figure 2: OpenAI-o1-preview plans for Termes: o1-preview frequently exhibits hallucination during the planning process. Specifically, in steps three and four, the LLM violates predefined rules when selecting and leveraging actions. Additionally, step four hallucinates the achievement of the goal, leading to incorrect or unrealistic outcomes. Even when using o1-preview itself to evaluate the hallucinated plan, it incorrectly identifies the plan as valid.
  • Figure 3: Left: The performance trend of BoN with increasing sampling numbers. Right: The performance trend of iVML with increasing training epochs (Setting the BoN sampling number N = 8).
  • Figure 4: The performance of iVML on NL2Domain tasks across different initialization settings.
  • Figure 5: The performance of iVML on Prob2Domain tasks across different initialization settings.
  • ...and 1 more figures