Table of Contents
Fetching ...

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

Muyan Weng, Defu Cao, Wei Yang, Yashaswi Sharma, Yan Liu

TL;DR

TemporalBench addresses a core gap in time-series evaluation by moving beyond pure forecasting to diagnostic, context-aware temporal reasoning. It introduces a four-tier taxonomy (T1–T4) spanning historical understanding, context-free prediction, contextual reasoning, and event-conditioned forecasting across four real-world domains, with a unified transformation pipeline that injects or detects events and generates robust ground-truth labels. Through extensive experiments, the authors demonstrate that strong numerical forecasting does not reliably translate into accurate contextual or event-aware decisions, and that agent-based frameworks exhibit fragmented strengths with domain-dependent limitations. The benchmark, together with its public dataset and leaderboard, provides a principled platform for evaluating temporal competencies and guiding the design of robust, context-aware time-series agents.

Abstract

It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instead, existing agent frameworks exhibit fragmented strengths and systematic failure modes that remain largely hidden under forecasting-only benchmarks. The TemporalBench dataset is publicly available at https://huggingface.co/datasets/Melady/TemporalBench, and we additionally provide a public leaderboard at https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard.

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

TL;DR

TemporalBench addresses a core gap in time-series evaluation by moving beyond pure forecasting to diagnostic, context-aware temporal reasoning. It introduces a four-tier taxonomy (T1–T4) spanning historical understanding, context-free prediction, contextual reasoning, and event-conditioned forecasting across four real-world domains, with a unified transformation pipeline that injects or detects events and generates robust ground-truth labels. Through extensive experiments, the authors demonstrate that strong numerical forecasting does not reliably translate into accurate contextual or event-aware decisions, and that agent-based frameworks exhibit fragmented strengths with domain-dependent limitations. The benchmark, together with its public dataset and leaderboard, provides a principled platform for evaluating temporal competencies and guiding the design of robust, context-aware time-series agents.

Abstract

It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instead, existing agent frameworks exhibit fragmented strengths and systematic failure modes that remain largely hidden under forecasting-only benchmarks. The TemporalBench dataset is publicly available at https://huggingface.co/datasets/Melady/TemporalBench, and we additionally provide a public leaderboard at https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard.
Paper Structure (48 sections, 8 figures, 7 tables, 2 algorithms)

This paper contains 48 sections, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the task generation pipeline in our benchmark, illustrating how raw time-series data are transformed into T1--T4 tasks through event generation or detection, prompt assembly, and format-aware task construction.
  • Figure 2: Radar plots showing the performance of different agents on the six T3 reasoning dimensions (C1--C6) under different base LLMs. Each subplot corresponds to a base model, and each curve represents an agent.
  • Figure 3: Distribution of error types across five agents using gpt-4o as the base LLM, aggregated over all datasets and tasks. Each pie chart corresponds to an agent and shows the proportion of different failure modes.
  • Figure 4: Effect of input time-series length on agent performance across four datasets. Each subplot corresponds to a dataset, and multi-choice accuracy is reported for different agents under varying input lengths. Lines with the same color denote the same agent, while darker to lighter shades represent T1--T4 tasks, respectively.
  • Figure 5: A full example instance (PSML) showing tasks T1--T4 in TemporalBench. All tiers share the same underlying time-series context, while T4 is additionally conditioned on an explicit event description.
  • ...and 3 more figures