Table of Contents
Fetching ...

TCP: a Benchmark for Temporal Constraint-Based Planning

Zifeng Ding, Sikuan Yan, Zhangdie Yuan, Xianglong Hu, Fangru Lin, Andreas Vlachos

TL;DR

TCP introduces a comprehensive benchmark for temporal constraint-based planning in dialogue-form problems, addressing multi-constraint scheduling across diverse domains. The authors employ a scalable 600-instance generation pipeline that couples abstract prototypes with domain-rich scenarios and rigorous quality checks, while anonymizing content to reduce memorization effects. Experimental results show that even state-of-the-art reasoning LLMs struggle with asynchronous constraints and time-zone reasoning, though they outperform standard LLMs with CoT prompts, with longer, more structured reasoning correlating with higher accuracy. Overall, TCP reveals key weaknesses in current temporal planning capabilities and provides a pragmatic, open-source platform to drive future improvements in LLM-based planning under temporal constraints.

Abstract

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we generate abstract problem prototypes that are then paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models may struggle with TCP, highlighting its difficulty and revealing limitations in LLMs' temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.

TCP: a Benchmark for Temporal Constraint-Based Planning

TL;DR

TCP introduces a comprehensive benchmark for temporal constraint-based planning in dialogue-form problems, addressing multi-constraint scheduling across diverse domains. The authors employ a scalable 600-instance generation pipeline that couples abstract prototypes with domain-rich scenarios and rigorous quality checks, while anonymizing content to reduce memorization effects. Experimental results show that even state-of-the-art reasoning LLMs struggle with asynchronous constraints and time-zone reasoning, though they outperform standard LLMs with CoT prompts, with longer, more structured reasoning correlating with higher accuracy. Overall, TCP reveals key weaknesses in current temporal planning capabilities and provides a pragmatic, open-source platform to drive future improvements in LLM-based planning under temporal constraints.

Abstract

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we generate abstract problem prototypes that are then paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models may struggle with TCP, highlighting its difficulty and revealing limitations in LLMs' temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.

Paper Structure

This paper contains 36 sections, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Example short (left) and long (right) problems in TCP. Some parts are abbreviated due to limited space. The red underlined contents correspond to the temporal constraints. The problem on the left belongs to the domain of Business Administration and involves the dependency type A $\rightarrow$ B, A $\rightarrow$ C. The problem on the right falls within the domain of Communications/Media Studies and involves the dependency type A $\rightarrow$ B, B $\rightarrow$ C. See Sec. \ref{['sec: benchmark construction']} for the details of involved domains and dependency types in TCP. See Appendix \ref{['app: examples']} for more example problems.
  • Figure 2: Overview of our benchmark generation pipeline.
  • Figure 3: Problem prototype examples. Left: prototype of the short problem in Fig. \ref{['fig:data_example']} (left). Right: prototype of the long problem in Fig. \ref{['fig:data_example']} (right). The red underlined contents correspond to the temporal constraints.
  • Figure 4: Performance on different dependency types for short and long problems (left) and on different domains for reasoning and standard LLMs w. CoT prompting (right). CS = Computer Science, BA = Business Administration, ME = Mechanical Engineering, BIO = Biology, ECO = Economics, PSY = Psychology, PS = Political Science, EE = Electrical Engineering, COMM = Communications/Media Studies, HS = Nursing/Health Sciences.
  • Figure 5: Distribution of output lengths (number of tokens) for correct and incorrect predictions. Longer outputs tend to lead to correct answers and shorter outputs tend to correlate with incorrect predictions.
  • ...and 14 more figures