Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning
Felix Parker, Nimeesha Chan, Chi Zhang, Kimia Ghobadi
TL;DR
COUNTS introduces an end-to-end framework that trains LLMs to perform explicit chain-of-thought reasoning on numerical time-series tasks by combining high-fidelity RVQ-VAE time-series tokenization, SFT-driven adaptation, and reinforcement learning with verifiable rewards via GRPO. This setup enables an LLM to interleave textual and time-series tokens, generate CoT steps, and optimize for task correctness across forecasting, classification, and medical signal interpretation. Experiments on ECG-QA, contextual forecasting, and UCR classification demonstrate meaningful improvements when employing RL-based CoT compared to supervised training, with substantial gains on reasoning-intensive tasks. These results position time-series analysis as a viable domain for RL-based reasoning and suggest directions for unified rewards, efficiency improvements, and multivariate extensions.
Abstract
Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models' reach. Tasks like medical diagnosis and weather forecasting require sequential reasoning processes -- including counterfactual analysis, logical deduction, knowledge application, and multi-modal contextual integration -- that existing time series models cannot explicitly perform. While recent research has shown large language models (LLMs) can achieve sophisticated Chain-of-Thought (CoT) reasoning through reinforcement learning (RL), these advances have primarily focused on mathematical and coding domains, with LLMs still demonstrating poor performance on time series tasks. We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains LLMs to perform CoT reasoning across diverse time series tasks using RL with verifiable rewards. Our approach employs a Residual Vector-Quantized VAE to create high-fidelity discrete tokens that seamlessly integrate into a pre-trained LLM's vocabulary. COUNTS undergoes a two-stage training process: first, supervised fine-tuning on time series analysis tasks to master our novel representations, followed by Group Relative Policy Optimization training on verifiable problems using prompting strategies that encourage explicit reasoning steps before producing final answers. Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.
