Table of Contents
Fetching ...

Evaluating System 1 vs. 2 Reasoning Approaches for Zero-Shot Time Series Forecasting: A Benchmark and Insights

Haoxin Liu, Zhiyuan Zhao, Shiduo Li, B. Aditya Prakash

TL;DR

The paper addresses zero-shot time-series forecasting by introducing ReC4TS, a comprehensive benchmark that evaluates reasoning strategies across unimodal and multimodal TSF in eight domains and four forecasting settings, using $MSE$ as the evaluation metric. It systematically contrasts direct System 1 reasoning, test-time enhancements (CoT, Self-Consistency, Self-Correction), and post-training System 2 strategies (notably GRPO-based DeepSeek-R1) on a suite of foundation models. Key findings show that self-consistency is the most reliable test-time strategy, System 2 approaches are generally less effective except in specific DeepSeek-R1 cases, and multimodal TSF benefits more from reasoning than unimodal TSF. The work contributes the TimeThinking reasoning-annotated TSF dataset, a test-time scaling law validated on foundation TSF models, and the open-source ReC4TS toolkit, collectively enabling ongoing research into reasoning-enabled zero-shot TSF.

Abstract

Reasoning ability is crucial for solving challenging tasks. With the advancement of foundation models, such as the emergence of large language models (LLMs), a wide range of reasoning strategies has been proposed, including test-time enhancements, such as Chain-ofThought, and post-training optimizations, as used in DeepSeek-R1. While these reasoning strategies have demonstrated effectiveness across various challenging language or vision tasks, their applicability and impact on time-series forecasting (TSF), particularly the challenging zero-shot TSF, remain largely unexplored. In particular, it is unclear whether zero-shot TSF benefits from reasoning and, if so, what types of reasoning strategies are most effective. To bridge this gap, we propose ReC4TS, the first benchmark that systematically evaluates the effectiveness of popular reasoning strategies when applied to zero-shot TSF tasks. ReC4TS conducts comprehensive evaluations across datasets spanning eight domains, covering both unimodal and multimodal with short-term and longterm forecasting tasks. More importantly, ReC4TS provides key insights: (1) Self-consistency emerges as the most effective test-time reasoning strategy; (2) Group-relative policy optimization emerges as a more suitable approach for incentivizing reasoning ability during post-training; (3) Multimodal TSF benefits more from reasoning strategies compared to unimodal TSF. Beyond these insights, ReC4TS establishes two pioneering starting blocks to support future zero-shot TSF reasoning research: (1) A novel dataset, TimeThinking, containing forecasting samples annotated with reasoning trajectories from multiple advanced LLMs, and (2) A new and simple test-time scaling-law validated on foundational TSF models enabled by self-consistency reasoning strategy. All data and code are publicly accessible at: https://github.com/AdityaLab/OpenTimeR

Evaluating System 1 vs. 2 Reasoning Approaches for Zero-Shot Time Series Forecasting: A Benchmark and Insights

TL;DR

The paper addresses zero-shot time-series forecasting by introducing ReC4TS, a comprehensive benchmark that evaluates reasoning strategies across unimodal and multimodal TSF in eight domains and four forecasting settings, using as the evaluation metric. It systematically contrasts direct System 1 reasoning, test-time enhancements (CoT, Self-Consistency, Self-Correction), and post-training System 2 strategies (notably GRPO-based DeepSeek-R1) on a suite of foundation models. Key findings show that self-consistency is the most reliable test-time strategy, System 2 approaches are generally less effective except in specific DeepSeek-R1 cases, and multimodal TSF benefits more from reasoning than unimodal TSF. The work contributes the TimeThinking reasoning-annotated TSF dataset, a test-time scaling law validated on foundation TSF models, and the open-source ReC4TS toolkit, collectively enabling ongoing research into reasoning-enabled zero-shot TSF.

Abstract

Reasoning ability is crucial for solving challenging tasks. With the advancement of foundation models, such as the emergence of large language models (LLMs), a wide range of reasoning strategies has been proposed, including test-time enhancements, such as Chain-ofThought, and post-training optimizations, as used in DeepSeek-R1. While these reasoning strategies have demonstrated effectiveness across various challenging language or vision tasks, their applicability and impact on time-series forecasting (TSF), particularly the challenging zero-shot TSF, remain largely unexplored. In particular, it is unclear whether zero-shot TSF benefits from reasoning and, if so, what types of reasoning strategies are most effective. To bridge this gap, we propose ReC4TS, the first benchmark that systematically evaluates the effectiveness of popular reasoning strategies when applied to zero-shot TSF tasks. ReC4TS conducts comprehensive evaluations across datasets spanning eight domains, covering both unimodal and multimodal with short-term and longterm forecasting tasks. More importantly, ReC4TS provides key insights: (1) Self-consistency emerges as the most effective test-time reasoning strategy; (2) Group-relative policy optimization emerges as a more suitable approach for incentivizing reasoning ability during post-training; (3) Multimodal TSF benefits more from reasoning strategies compared to unimodal TSF. Beyond these insights, ReC4TS establishes two pioneering starting blocks to support future zero-shot TSF reasoning research: (1) A novel dataset, TimeThinking, containing forecasting samples annotated with reasoning trajectories from multiple advanced LLMs, and (2) A new and simple test-time scaling-law validated on foundational TSF models enabled by self-consistency reasoning strategy. All data and code are publicly accessible at: https://github.com/AdityaLab/OpenTimeR

Paper Structure

This paper contains 38 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: The reasoning strategies included in the proposed ReC4TS benchmark. ReC4TS systematically includes three mainstream approaches: the direct System 1, i.e., directly using generative models such as GPT-4o for reasoning; the test-time-enhanced System 1, including Chain-of-Thought, Self-Consistency, and Self-Correction; the post-training-empowered System 2, which enables built-in reasoning capabilities through reinforcement learning, such as DeepSeek-R1 guo2025deepseek.
  • Figure 2: The average win rate of reasoning strategies compared to corresponding direct System 1 across all datasets and settings. We observe the consistent and significant effectiveness of self-consistency, as well as the unique effectiveness of DeepSeek-R1 among System 2 strategies.
  • Figure 3: Verified test-time scaling law on foundation time-series models inspired by our insights. MSE and MAE are normalized based on the performance of one sampled path. The performance of Chronos and Moirai continuously improves as the number of sampled reasoning paths in the self-consistency reasoning strategy increases.
  • Figure 4: Visualization of time-sereis data.
  • Figure 5: Prompt used for multimodal time-series forecasting.
  • ...and 9 more figures