LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Weizhi Tang; Kwabena Nuamah; Vaishak Belle

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Weizhi Tang, Kwabena Nuamah, Vaishak Belle

TL;DR

The paper tackles the challenge of evaluating temporal reasoning in large language models by introducing LTLBench, a pipeline that automatically synthesizes TR problems from random graphs and Linear Temporal Logic (LTL) formulas, with ground-truth labels produced by the NuSMV model checker. By generating 2000 TR challenges and evaluating 12 LLMs across 5 prompting strategies, the authors show that Few-Shot CoT generally yields the strongest performance, with GPT-5-Mini achieving up to 93.95% accuracy, while results reveal three recurring reasoning issues and sensitivity to problem complexity. The work demonstrates that increasing the number of operators $m$ or events $n$ increases problem difficulty and reduces robustness, highlighting gaps in current TR abilities and offering a scalable, formal framework for future benchmark design and model training. Limitations include a restricted operator set and a linear-time evaluation focus, with future directions proposed to incorporate branching-time logics such as CTL/CTL* to broaden the coverage of temporal reasoning tasks.

Abstract

Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely \LTL, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

TL;DR

or events

increases problem difficulty and reduces robustness, highlighting gaps in current TR abilities and offering a scalable, formal framework for future benchmark design and model training. Limitations include a restricted operator set and a linear-time evaluation focus, with future directions proposed to incorporate branching-time logics such as CTL/CTL* to broaden the coverage of temporal reasoning tasks.

Abstract

TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.

Paper Structure (17 sections, 2 equations, 5 figures, 1 table)

This paper contains 17 sections, 2 equations, 5 figures, 1 table.

Introduction
Preliminary
Related Work
TR in LLMs
TR Benchmarks
TR Problem Generation Pipeline
Random Directed Graph Generation
LTL Formula Generation
NuSMV Code Generation
Natural Language Generation
Experiment Settings
Results and Analyses
Evaluation with LTLBench
Impact of Increasing $m$
Impact of increasing $n$
...and 2 more sections

Figures (5)

Figure 1: The overview of the TR problem generation pipeline.
Figure 2: An example of a generated random directed graph.
Figure 3: LLMs Performance with Different Methods.
Figure 4: Accuracy for DeepSeek-Reasoner, GPT-5-Mini, and Qwen-Turbo equipped with each method as the number of formula operators increases.
Figure 5: Accuracy for DeepSeek-Reasoner, GPT-5-Mini, and Qwen-Turbo equipped with each method as the number of events increases.

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

TL;DR

Abstract

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)