Table of Contents
Fetching ...

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito

TL;DR

TempoBench delivers a formally grounded, verifiable benchmark for temporal reasoning in LLMs by coupling two diagnostic tasks—Temporal Trace Evaluation and Temporal Causality Evaluation—with a parameterizable difficulty space. Built on reactive synthesis and HOA formalisms, it enables precise measurement of how structural factors like states and transitions affect reasoning and credit assignment, beyond aggregate accuracy. The study shows distinct performance gaps as problem complexity grows and demonstrates strong, dataset-wide correlations between benchmark features and model performance, underscoring TempoBench as a diagnostic tool rather than a mere leaderboard. The framework holds practical significance for deploying more reliable LLM agents in temporally-structured, real-world tasks and for guiding targeted training and evaluation of temporal reasoning capabilities.

Abstract

Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \href{https://github.com/nik-hz/tempobench}{GitHub repository}.

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

TL;DR

TempoBench delivers a formally grounded, verifiable benchmark for temporal reasoning in LLMs by coupling two diagnostic tasks—Temporal Trace Evaluation and Temporal Causality Evaluation—with a parameterizable difficulty space. Built on reactive synthesis and HOA formalisms, it enables precise measurement of how structural factors like states and transitions affect reasoning and credit assignment, beyond aggregate accuracy. The study shows distinct performance gaps as problem complexity grows and demonstrates strong, dataset-wide correlations between benchmark features and model performance, underscoring TempoBench as a diagnostic tool rather than a mere leaderboard. The framework holds practical significance for deploying more reliable LLM agents in temporally-structured, real-world tasks and for guiding targeted training and evaluation of temporal reasoning capabilities.

Abstract

Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \href{https://github.com/nik-hz/tempobench}{GitHub repository}.

Paper Structure

This paper contains 29 sections, 8 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Overview of the TempoBench framework. TempoBench includes 5 key features for modeling temporal problem difficulty and uses them to conduct rigorous statistical analysis of reasoning performance. TempoBench consists of two tasks: Temporal Trace Evaluation (TTE) and Temporal Causality Evaluation (TCE)
  • Figure 2: Sample knowledge graph showcasing relationship inference. In this case, asked to determine who has been to the moon, the LLM is highly likely to have prior knowledge of this fact, showing a deficiency of temporal benchmarks.
  • Figure 3: Sample visualization of a tempo-bench problem. This example shows a trace through a system and a causal explanation of what caused out_0 at step 4. A correct solution on this benchmark identifies the causal effects of XXXX out_0. Light pink for negative constraints (-1). White for neutral (0). Light blue for positive constraints (+1)
  • Figure 4: Pipeline flowchart for data generation in TempoBench. This flowchart illustrates the creation of a formal controller and then the extraction of key data needed to solve problems \ref{['task:TTE']} and \ref{['task:TCP']}. A more detailed pipeline visualization is provided in Appendix \ref{['fig:app:pipeline']}
  • Figure 5: Pipeline flowchart for the evaluation harness that we use to score reasoning model performance
  • ...and 13 more figures