Table of Contents
Fetching ...

LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt

TL;DR

LexiCon introduces a natural-language-based benchmark for planning under temporal constraints, enabling principled evaluation of LLMs on constrained planning tasks expressed in NL. It combines a constrained problem generator, an NL translator, and an automated verifier to generate, translate, and validate NL-constrained planning problems across multiple domains with a $PDDL3.0$-style formalism. Across BabyAI, Blocksworld, Logistics, Sokoban, and AlfWorld, reasoning-enabled LLMs show robust performance on simple constraint sets but deteriorate as constraint complexity grows, highlighting current limits in algorithmic planning capabilities. The framework supports on-the-fly problem generation for real-time evaluation and points to future work in partially observable settings, broader constraint classes, and integration with reinforcement learning to broaden practical applicability.

Abstract

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon -- a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.

LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

TL;DR

LexiCon introduces a natural-language-based benchmark for planning under temporal constraints, enabling principled evaluation of LLMs on constrained planning tasks expressed in NL. It combines a constrained problem generator, an NL translator, and an automated verifier to generate, translate, and validate NL-constrained planning problems across multiple domains with a -style formalism. Across BabyAI, Blocksworld, Logistics, Sokoban, and AlfWorld, reasoning-enabled LLMs show robust performance on simple constraint sets but deteriorate as constraint complexity grows, highlighting current limits in algorithmic planning capabilities. The framework supports on-the-fly problem generation for real-time evaluation and points to future work in partially observable settings, broader constraint classes, and integration with reinforcement learning to broaden practical applicability.

Abstract

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon -- a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.

Paper Structure

This paper contains 21 sections, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Constrained problems on environments supported in LexiCon. From left to right: BabyAI babyai, Blocksworld DBLP:journals/ai/GuptaN92, Logistics DBLP:journals/aim/McDermott00, Sokoban DBLP:conf/nips/FengGS20 and AlfWorld alfworld. A constrained planning task is specified by an initial state, a goal, and a set of constraints to be respected.
  • Figure 2: Left: The initial state of the constrained planning problem in Example \ref{['ex:running']}. The red triangle represents the agent. Bottom: The goal and the constraints of the problem in PDDL3.0. Middle: Optimal plan for this problem. Right: Optimal plan for the corresponding unconstrained problem.
  • Figure 3: Left: The architecture of LexiCon. Solid arrows denote input/output data transfers. Dashed arrows denote optional input. Right: The translator of LexiCon. Dotted arrows express content extraction.
  • Figure 4: Top: Fragment of our natural language description of the constrained problem of Example \ref{['ex:running']}. Bottom: System role prompt.
  • Figure 5: Invalid plans suggested by LLMs for the constrained problem in Example \ref{['ex:running']}.
  • ...and 4 more figures