Table of Contents
Fetching ...

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

Haoxin Liu, Chenghao Liu, B. Aditya Prakash

TL;DR

This work addresses the challenge of enabling and evaluating large language models for time-series reasoning. It introduces TimerBed, a hierarchical testbed that probes TsR across stratified reasoning patterns, real-world tasks, and multiple LLMs with varied reasoning strategies, revealing systematic weaknesses in zero-shot and few-shot settings due to direct numerical data modeling. To overcome these limitations, the authors propose VL-Time, a prompt-based framework that uses visualization-based data modeling plus language guided reasoning in a plan-then-solve workflow, achieving substantial gains such as up to $140\%$ average improvement and up to $433\%$ in few-shot settings with greatly reduced token costs (about $1\%$ of numeric modeling). The study demonstrates that visualization and targeted prompting are effective for unlocking multimodal LLM TsR capabilities, suggesting a practical path for integrating LLMs into time-series analysis and planning future visual-centric approaches. The work has practical implications for deploying efficient, interpretable TsR reasoning in domains requiring rapid analysis of long time-series data.

Abstract

Large language models (LLMs), with demonstrated reasoning abilities across multiple domains, are largely underexplored for time-series reasoning (TsR), which is ubiquitous in the real world. In this work, we propose TimerBed, the first comprehensive testbed for evaluating LLMs' TsR performance. Specifically, TimerBed includes stratified reasoning patterns with real-world tasks, comprehensive combinations of LLMs and reasoning strategies, and various supervised models as comparison anchors. We perform extensive experiments with TimerBed, test multiple current beliefs, and verify the initial failures of LLMs in TsR, evidenced by the ineffectiveness of zero shot (ZST) and performance degradation of few shot in-context learning (ICL). Further, we identify one possible root cause: the numerical modeling of data. To address this, we propose a prompt-based solution VL-Time, using visualization-modeled data and language-guided reasoning. Experimental results demonstrate that Vl-Time enables multimodal LLMs to be non-trivial ZST and powerful ICL reasoners for time series, achieving about 140% average performance improvement and 99% average token costs reduction.

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

TL;DR

This work addresses the challenge of enabling and evaluating large language models for time-series reasoning. It introduces TimerBed, a hierarchical testbed that probes TsR across stratified reasoning patterns, real-world tasks, and multiple LLMs with varied reasoning strategies, revealing systematic weaknesses in zero-shot and few-shot settings due to direct numerical data modeling. To overcome these limitations, the authors propose VL-Time, a prompt-based framework that uses visualization-based data modeling plus language guided reasoning in a plan-then-solve workflow, achieving substantial gains such as up to average improvement and up to in few-shot settings with greatly reduced token costs (about of numeric modeling). The study demonstrates that visualization and targeted prompting are effective for unlocking multimodal LLM TsR capabilities, suggesting a practical path for integrating LLMs into time-series analysis and planning future visual-centric approaches. The work has practical implications for deploying efficient, interpretable TsR reasoning in domains requiring rapid analysis of long time-series data.

Abstract

Large language models (LLMs), with demonstrated reasoning abilities across multiple domains, are largely underexplored for time-series reasoning (TsR), which is ubiquitous in the real world. In this work, we propose TimerBed, the first comprehensive testbed for evaluating LLMs' TsR performance. Specifically, TimerBed includes stratified reasoning patterns with real-world tasks, comprehensive combinations of LLMs and reasoning strategies, and various supervised models as comparison anchors. We perform extensive experiments with TimerBed, test multiple current beliefs, and verify the initial failures of LLMs in TsR, evidenced by the ineffectiveness of zero shot (ZST) and performance degradation of few shot in-context learning (ICL). Further, we identify one possible root cause: the numerical modeling of data. To address this, we propose a prompt-based solution VL-Time, using visualization-modeled data and language-guided reasoning. Experimental results demonstrate that Vl-Time enables multimodal LLMs to be non-trivial ZST and powerful ICL reasoners for time series, achieving about 140% average performance improvement and 99% average token costs reduction.

Paper Structure

This paper contains 85 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of our proposed testbed, TimerBed, for evaluating LLMs reasoning about time series. TimerBed defines three patterns of TsR tasks with increasing difficulty: simple deterministic reasoning, complex deterministic reasoning, and probabilistic reasoning. For each reasoning pattern, TimerBed matches two real-world tasks and presents an example in the figure. TimerBed covers four types of LLMs with the corresponding most advanced models and three reasoning strategies for comprehensive evaluation. TimerBed adopts eight supervised time-series models and random guessing as anchors to quantify the success of LLMs for TsR.
  • Figure 2: Normalized Results of Zero-Shot Time-series Reasoning. The accuracy is normalized by random guessing. Detailed original results are in Table \ref{['detailofresult1']}. LLMs consistently show near-random performance with ZST.
  • Figure 3: Normalized Results of Chain-of-Thought and Few-shot In-Context-Learning Time-series Reasoning. Each subfigure corresponds to one LLM. The accuracy is normalized by random guessing. Detailed original results are in Table \ref{['detailofresult1']}. CoT shows marginal improvement, while ICL leads to performance degradation.
  • Figure 4: Comparison of existing numerical modeling solution, denoted as "Traditional Solution", and proposed VL-Time. The key difference is that VL-Time replaces numerical modeling with visualization modeling for time-series data, which enhances feature extraction and reduces context length. VL-Time further divides the entire reasoning process into planning and solving stages, mimicking the behavior of human experts. A full example is provided in Section \ref{['sec:full']}. For each task, the planning stage needs to be executed only once.
  • Figure 5: Ablation Study of VL-Time. The planning stage and visualization designs, including the textual legend and timestamps, are all validated as effective.
  • ...and 8 more figures