Language Model Planners do not Scale, but do Formalizers?

Owen Jiang; Cassie Huang; Ashish Sabharwal; Li Zhang

Language Model Planners do not Scale, but do Formalizers?

Owen Jiang, Cassie Huang, Ashish Sabharwal, Li Zhang

Abstract

Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.

Language Model Planners do not Scale, but do Formalizers?

Abstract

. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.

Paper Structure (15 sections, 6 figures)

This paper contains 15 sections, 6 figures.

Introduction
Scaling of Planners vs. Formalizers
Benchmark
Models
Scaling Performance
Theoretical Consideration
Unraveling Problems
Conclusion
Limitation
Problem Size and Plan Length Relation
Dataset Construction
Experimental Details
Potential Risks
Personally Identifying Information
Descriptive Statistics

Figures (6)

Figure 1: Performance of state-of-the-art LLMs, both as planners and as formalizers, on a classical planning domains with increasing problem complexity quantified by entity space size. For the divide-and-conquer (D&C) technique, we omit Gemini 3 Flash due to its already perfect accuracy as a regular formalizer, and we omit Qwen3-32B due to its worse performance.
Figure 2: An illustration of the task formulation and pipelines. The LLM-as-formalizer is given a domain description, domain file, and a problem description. It produces a PDDL problem file which is fed to a programmatic planner to derive a plan, which is evaluated.
Figure 3: An illustration of the nature of combinatorial explosion in the BlocksWorld domain. Under different paradigms, LLMs generate tokens tantamount to the complexity of different spaces.
Figure 4: An illustration of the unraveling problems and the LLM-as-higher-order-formalizer methodology, which generates a program that algorithmically produces PDDL, before input into a solver to derive a plan.
Figure 5: Performance of state-of-the-art LLMs, as planners, formalizers, and H-O formalizers, on two classical planning domains with increasing problem complexity quantified by input size.
...and 1 more figures

Language Model Planners do not Scale, but do Formalizers?

Abstract

Language Model Planners do not Scale, but do Formalizers?

Authors

Abstract

Table of Contents

Figures (6)