Table of Contents
Fetching ...

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, Ping Luo

TL;DR

Text2World introduces a robust, benchmark-driven framework to evaluate large language models on generating symbolic world models from natural language. By grounding evaluation in hundreds of PDDL domains and multi-criteria, execution-based metrics, the study reveals that even advanced LLMs struggle with world modeling, while reinforcement-learning–trained large reasoning models perform best among the tested models. The work further demonstrates that error correction, test-time scaling, fine-tuning, and agent-training can meaningfully improve performance, and that concrete descriptions ease the inference burden. The benchmark, analysis, and proposed strategies establish a foundation for advancing LLMs as practical world-model generators and offer a reusable resource for the community.

Abstract

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

TL;DR

Text2World introduces a robust, benchmark-driven framework to evaluate large language models on generating symbolic world models from natural language. By grounding evaluation in hundreds of PDDL domains and multi-criteria, execution-based metrics, the study reveals that even advanced LLMs struggle with world modeling, while reinforcement-learning–trained large reasoning models perform best among the tested models. The work further demonstrates that error correction, test-time scaling, fine-tuning, and agent-training can meaningfully improve performance, and that concrete descriptions ease the inference burden. The benchmark, analysis, and proposed strategies establish a foundation for advancing LLMs as practical world-model generators and offer a reusable resource for the community.

Abstract

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Paper Structure

This paper contains 45 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of Text2World.
  • Figure 2: Left: Dataset construction process including: (a) Data Acquisition (§\ref{['sec:data_acquisition']}); (b) Data Filtering and Manual Selection (§\ref{['sec:data_filtering']}); (c) Data Annotation and Quality Assurance (§\ref{['sec:data_annotation']} and §\ref{['sec:quality_assurance']}). Right: Key statistics of Text2World. Tokens are counted by GPT-2 openai2019gpt2 tokenizer. The style is referenced from chen-etal-2024-m3cot.
  • Figure 3: n-gram contamination rate of Text2World and prior works.
  • Figure 4: Top: The frequency of requirements distribution. Bottom: Word cloud of concepts in Text2World.
  • Figure 5: Left: The distribution of syntax error types during the progression of correction. Right: The distribution of semantic error types.
  • ...and 3 more figures