Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Jiahui Zhou; Dan Li; Boxin Li; Xiao Zhang; Erli Meng; Lin Li; Zhuomin Chen; Jian Lou; See-Kiong Ng

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Jiahui Zhou, Dan Li, Boxin Li, Xiao Zhang, Erli Meng, Lin Li, Zhuomin Chen, Jian Lou, See-Kiong Ng

TL;DR

The paper introduces VeriTime, a framework for time series reasoning with LLMs that combines TS-tailored data synthesis (TSRgen) and a principled RL fine-tuning protocol with data scheduling. It constructs TSRBench, a TS-text multimodal dataset with process-verifiable CoT annotations across scenario-based and knowledge-based tasks, and uses a two-stage learning paradigm plus GRPO-based optimization with multi-objective rewards to improve intermediate reasoning and final accuracy. Empirical results show VeriTime delivers substantial performance gains across TS reasoning tasks, enables compact 3B–4B models to approach or surpass larger models, and demonstrates efficiency benefits through TS-tailored CoT and selective rollout. These findings underscore the practicality of integrating process-level supervision and difficulty-aware data scheduling to advance time series reasoning in LLMs, with broad implications for efficient, interpretable TS analysis in real-world applications.

Abstract

Time series is a pervasive data type across various application domains, rendering the reasonable solving of diverse time series tasks a long-standing goal. Recent advances in large language models (LLMs), especially their reasoning abilities unlocked through reinforcement learning (RL), have opened new opportunities for tackling tasks with long Chain-of-Thought (CoT) reasoning. However, leveraging LLM reasoning for time series remains in its infancy, hindered by the absence of carefully curated time series CoT data for training, limited data efficiency caused by underexplored data scheduling, and the lack of RL algorithms tailored for exploiting such time series CoT data. In this paper, we introduce VeriTime, a framework that tailors LLMs for time series reasoning through data synthesis, data scheduling, and RL training. First, we propose a data synthesis pipeline that constructs a TS-text multimodal dataset with process-verifiable annotations. Second, we design a data scheduling mechanism that arranges training samples according to a principled hierarchy of difficulty and task taxonomy. Third, we develop a two-stage reinforcement finetuning featuring fine-grained, multi-objective rewards that leverage verifiable process-level CoT data. Extensive experiments show that VeriTime substantially boosts LLM performance across diverse time series reasoning tasks. Notably, it enables compact 3B, 4B models to achieve reasoning capabilities on par with or exceeding those of larger proprietary LLMs.

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

TL;DR

Abstract

Paper Structure (35 sections, 19 equations, 11 figures, 13 tables, 1 algorithm)

This paper contains 35 sections, 19 equations, 11 figures, 13 tables, 1 algorithm.

Introduction
Methodology
TSRgen: Reasoning Data Synthesis Pipeline
Data and Task Selection
TS-tailored CoT Thinking Process
Automatic Reasoning Data Annotation and Verification
VeriTime: RL Fine-tuning with Data Scheduling
Multi-Objective Reward Design of VeriTime
Evaluation
Experimental Setup
Main Results (RQ1)
Analysis of TS-tailored Chain-of-Thought (RQ2)
Studies of Reward Composition (RQ3)
Analysis of Data Scheduling (RQ4)
Conclusion
...and 20 more sections

Figures (11)

Figure 1: The overall framework of the time series reasoning data generation pipeline TSRgen.
Figure 2: Overview of the proposed VeriTime. It consists of three stages: (1) Stage 1 leverages TSRBench to warmup a base LLM $\theta_0$ into $\theta_1$, which is subsequently used to perform difficulty stratification over all TSRBench tasks. (2) Stage 2 fine-tunes $\theta_1$ on samples with normal difficulty to obtain $\theta_2$, equipping the model with TS-oriented CoT thinking paradigms. (3) Stage 3 applies RL to optimize the accuracy of both final predictions and key intermediate reasoning steps, further enhancing reasoning quality and yielding $\theta_\text{final}$.
Figure 3: Step-wise Accuracy (%) comparison of Qwen2.5-3B-Instruct: warm-up SFT vs. stage 1 SFT + stage 2 RL on scenario-based reasoning tasks.
Figure 4: Ablation results on Qwen2.5-3B-Instruct. (a) Performance without structural, hard, or process rewards. (b) Performance without task comprehension, critical pattern, or alignment& verification rewards.
Figure 5: Performance comparison of training data selection methods on scenario-based reasoning tasks and knowledge-based reasoning tasks for the Qwen3-4B-Instruct model.
...and 6 more figures

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

TL;DR

Abstract

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)