Inferring Events from Time Series using Language Models

Mingtian Tan; Mike A. Merrill; Zack Gottesman; Tim Althoff; David Evans; Tom Hartvigsen

Inferring Events from Time Series using Language Models

Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, Tom Hartvigsen

TL;DR

The paper investigates whether Large Language Models can infer natural-language events from time-series data by introducing a benchmark that pairs real-valued win-probability time series with event descriptions from NBA and NFL games. It evaluates 18 LLMs, finds that reasoning-focused and open-weight models achieve strong performance (e.g., o1 leads NBA with 83% accuracy), and demonstrates that post-training via distillation and GRPO can substantially boost small models toward the top tier. Chain-of-Thought prompting further improves reasoning in many cases, though it can increase invalid outputs slightly. The study also analyzes the impact of contextual cues, data ablations, and time-series similarity, and validates generalization to open-domain domains such as Time-MMD and CryptoTrade, including scenarios where numerical information is masked. The work provides a reproducible framework and data, highlighting avenues for improving event inference from time series and informing future development of multimodal reasoning systems.

Abstract

Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. A common goal in analyzing time series data is to understand the underlying events that cause the observed variations. We conduct the first study of whether Large Language Models (LLMs) can infer events described with natural language from time series data. We evaluate 18 LLMs on a task to match event sequences with real-valued time series data using a new benchmark we develop using sports data. Several current LLMs demonstrate promising abilities, with OpenAI's o1 performing the best but with DS-R1-distill-Qwen-32B outperforming proprietary models such as GPT-4o. From insights derived from analyzing reasoning failures, we also find clear avenues to improve performance. By applying post-training optimizations, i.e., distillation and self-improvement, we significantly enhance the performance of the Qwen2.5 1.5B, achieving results second only to o1. All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime

Inferring Events from Time Series using Language Models

TL;DR

Abstract

Inferring Events from Time Series using Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)