Table of Contents
Fetching ...

Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning

Felix Parker, Nimeesha Chan, Chi Zhang, Kimia Ghobadi

TL;DR

COUNTS introduces an end-to-end framework that trains LLMs to perform explicit chain-of-thought reasoning on numerical time-series tasks by combining high-fidelity RVQ-VAE time-series tokenization, SFT-driven adaptation, and reinforcement learning with verifiable rewards via GRPO. This setup enables an LLM to interleave textual and time-series tokens, generate CoT steps, and optimize for task correctness across forecasting, classification, and medical signal interpretation. Experiments on ECG-QA, contextual forecasting, and UCR classification demonstrate meaningful improvements when employing RL-based CoT compared to supervised training, with substantial gains on reasoning-intensive tasks. These results position time-series analysis as a viable domain for RL-based reasoning and suggest directions for unified rewards, efficiency improvements, and multivariate extensions.

Abstract

Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models' reach. Tasks like medical diagnosis and weather forecasting require sequential reasoning processes -- including counterfactual analysis, logical deduction, knowledge application, and multi-modal contextual integration -- that existing time series models cannot explicitly perform. While recent research has shown large language models (LLMs) can achieve sophisticated Chain-of-Thought (CoT) reasoning through reinforcement learning (RL), these advances have primarily focused on mathematical and coding domains, with LLMs still demonstrating poor performance on time series tasks. We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains LLMs to perform CoT reasoning across diverse time series tasks using RL with verifiable rewards. Our approach employs a Residual Vector-Quantized VAE to create high-fidelity discrete tokens that seamlessly integrate into a pre-trained LLM's vocabulary. COUNTS undergoes a two-stage training process: first, supervised fine-tuning on time series analysis tasks to master our novel representations, followed by Group Relative Policy Optimization training on verifiable problems using prompting strategies that encourage explicit reasoning steps before producing final answers. Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.

Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning

TL;DR

COUNTS introduces an end-to-end framework that trains LLMs to perform explicit chain-of-thought reasoning on numerical time-series tasks by combining high-fidelity RVQ-VAE time-series tokenization, SFT-driven adaptation, and reinforcement learning with verifiable rewards via GRPO. This setup enables an LLM to interleave textual and time-series tokens, generate CoT steps, and optimize for task correctness across forecasting, classification, and medical signal interpretation. Experiments on ECG-QA, contextual forecasting, and UCR classification demonstrate meaningful improvements when employing RL-based CoT compared to supervised training, with substantial gains on reasoning-intensive tasks. These results position time-series analysis as a viable domain for RL-based reasoning and suggest directions for unified rewards, efficiency improvements, and multivariate extensions.

Abstract

Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models' reach. Tasks like medical diagnosis and weather forecasting require sequential reasoning processes -- including counterfactual analysis, logical deduction, knowledge application, and multi-modal contextual integration -- that existing time series models cannot explicitly perform. While recent research has shown large language models (LLMs) can achieve sophisticated Chain-of-Thought (CoT) reasoning through reinforcement learning (RL), these advances have primarily focused on mathematical and coding domains, with LLMs still demonstrating poor performance on time series tasks. We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains LLMs to perform CoT reasoning across diverse time series tasks using RL with verifiable rewards. Our approach employs a Residual Vector-Quantized VAE to create high-fidelity discrete tokens that seamlessly integrate into a pre-trained LLM's vocabulary. COUNTS undergoes a two-stage training process: first, supervised fine-tuning on time series analysis tasks to master our novel representations, followed by Group Relative Policy Optimization training on verifiable problems using prompting strategies that encourage explicit reasoning steps before producing final answers. Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.

Paper Structure

This paper contains 21 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The figure shows the tokenization process for time series patches, with Patch 2 highlighted as an example. Each input patch is processed through two parallel pathways: a scaling operation that generates a single scale token capturing magnitude information, and an MLP followed by a residual vector quantizer that produces three time series (TS) tokens encoding temporal patterns. This dual-pathway approach results in four tokens per patch, enabling comprehensive representation of both amplitude and temporal characteristics.
  • Figure 2: An LLM generates multiple sampled responses to an input prompt asking for ECG time series interpretation. Each response is evaluated by reward functions that assess format compliance (proper use of tags) and diagnostic correctness, with correct components highlighted in green. The resulting advantage scores are calculated on a color gradient (a = -2 to a = 2, red to green), which guide policy gradient updates.