Table of Contents
Fetching ...

Inferring Events from Time Series using Language Models

Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, Tom Hartvigsen

TL;DR

The paper investigates whether Large Language Models can infer natural-language events from time-series data by introducing a benchmark that pairs real-valued win-probability time series with event descriptions from NBA and NFL games. It evaluates 18 LLMs, finds that reasoning-focused and open-weight models achieve strong performance (e.g., o1 leads NBA with 83% accuracy), and demonstrates that post-training via distillation and GRPO can substantially boost small models toward the top tier. Chain-of-Thought prompting further improves reasoning in many cases, though it can increase invalid outputs slightly. The study also analyzes the impact of contextual cues, data ablations, and time-series similarity, and validates generalization to open-domain domains such as Time-MMD and CryptoTrade, including scenarios where numerical information is masked. The work provides a reproducible framework and data, highlighting avenues for improving event inference from time series and informing future development of multimodal reasoning systems.

Abstract

Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. A common goal in analyzing time series data is to understand the underlying events that cause the observed variations. We conduct the first study of whether Large Language Models (LLMs) can infer events described with natural language from time series data. We evaluate 18 LLMs on a task to match event sequences with real-valued time series data using a new benchmark we develop using sports data. Several current LLMs demonstrate promising abilities, with OpenAI's o1 performing the best but with DS-R1-distill-Qwen-32B outperforming proprietary models such as GPT-4o. From insights derived from analyzing reasoning failures, we also find clear avenues to improve performance. By applying post-training optimizations, i.e., distillation and self-improvement, we significantly enhance the performance of the Qwen2.5 1.5B, achieving results second only to o1. All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime

Inferring Events from Time Series using Language Models

TL;DR

The paper investigates whether Large Language Models can infer natural-language events from time-series data by introducing a benchmark that pairs real-valued win-probability time series with event descriptions from NBA and NFL games. It evaluates 18 LLMs, finds that reasoning-focused and open-weight models achieve strong performance (e.g., o1 leads NBA with 83% accuracy), and demonstrates that post-training via distillation and GRPO can substantially boost small models toward the top tier. Chain-of-Thought prompting further improves reasoning in many cases, though it can increase invalid outputs slightly. The study also analyzes the impact of contextual cues, data ablations, and time-series similarity, and validates generalization to open-domain domains such as Time-MMD and CryptoTrade, including scenarios where numerical information is masked. The work provides a reproducible framework and data, highlighting avenues for improving event inference from time series and informing future development of multimodal reasoning systems.

Abstract

Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. A common goal in analyzing time series data is to understand the underlying events that cause the observed variations. We conduct the first study of whether Large Language Models (LLMs) can infer events described with natural language from time series data. We evaluate 18 LLMs on a task to match event sequences with real-valued time series data using a new benchmark we develop using sports data. Several current LLMs demonstrate promising abilities, with OpenAI's o1 performing the best but with DS-R1-distill-Qwen-32B outperforming proprietary models such as GPT-4o. From insights derived from analyzing reasoning failures, we also find clear avenues to improve performance. By applying post-training optimizations, i.e., distillation and self-improvement, we significantly enhance the performance of the Qwen2.5 1.5B, achieving results second only to o1. All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime

Paper Structure

This paper contains 39 sections, 3 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Illustration of time series event reasoning. The prompt provides (in text form, see details later in the paper) a time series of real-valued data (win probabilities) and corresponding natural language event descriptions. The model is prompted to select the most likely sequence of events for some segment of the time series data where no events are provided. (This example is taken from near the end of an NBA game, which is 48 minutes regulation time, between the Dallas Mavericks (Team A) and Los Angeles Lakers (Team B), 1 November 2019.)
  • Figure 2: The performance on NBA data indicates that open-weight models, such as Qwen2.5 72B, achieve results comparable to or even surpassing proprietary models like GPT-4o. In particular, reasoning-focused models such as DS-R1-distill-Qwen-32B and OpenAI's o1 significantly outperform others. Additionally, Chain-of-Thought (CoT) prompting further enhances reasoning capabilities. Similar trends are observed in the NFL data, with details provided in \ref{['fig:nfl_perform']} in Appendix \ref{['app:extra_res']}. Note that open-weight models are presented in order of model size.
  • Figure 3: The performance of LLMs in distinguishing events corresponding to time series (win probabilities) with different levels of similarity. Time series similarity decreases as $x$ (i.e., time series distance) increases.
  • Figure 4: Examples of events and win probabilities in the NBA and NFL dataset. As the game progresses, ESPN provides descriptions of on-field events along with the corresponding win probabilities for each team at that moment. These probabilities can be considered a representation of the team's current state.
  • Figure 5: The performance of various language models on NFL events inferring through time series. Overall, this task is more challenging than NBA event reasoning.
  • ...and 10 more figures