Table of Contents
Fetching ...

Are Language Models Actually Useful for Time Series Forecasting?

Mingtian Tan, Mike A. Merrill, Vinayak Gupta, Tim Althoff, Thomas Hartvigsen

TL;DR

The paper investigates whether pretrained Large Language Models (LLMs) truly improve time-series forecasting. Through extensive ablations of three representative LLM-based forecasters and a comparison with simple, LLM-free encoders, the authors show that LLMs rarely outperform lightweight alternatives while consuming orders of magnitude more compute, challenging the assumption that LLMs' sequential reasoning transfers to time-series data. They further reveal that pretraining provides limited benefits, few-shot gains are minimal, and a simple patching+attention encoding can match or exceed performance. The findings urge a shift away from default LLM adoption in forecasting tasks toward efficient encoders and more promising multimodal avenues that leverage language at the interface, such as time-series reasoning.

Abstract

Large language models (LLMs) are being applied to time series forecasting. But are language models actually useful for time series? In a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade forecasting performance -- in most cases, the results even improve! We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and find that patching and attention structures perform similarly to LLM-based forecasters.

Are Language Models Actually Useful for Time Series Forecasting?

TL;DR

The paper investigates whether pretrained Large Language Models (LLMs) truly improve time-series forecasting. Through extensive ablations of three representative LLM-based forecasters and a comparison with simple, LLM-free encoders, the authors show that LLMs rarely outperform lightweight alternatives while consuming orders of magnitude more compute, challenging the assumption that LLMs' sequential reasoning transfers to time-series data. They further reveal that pretraining provides limited benefits, few-shot gains are minimal, and a simple patching+attention encoding can match or exceed performance. The findings urge a shift away from default LLM adoption in forecasting tasks toward efficient encoders and more promising multimodal avenues that leverage language at the interface, such as time-series reasoning.

Abstract

Large language models (LLMs) are being applied to time series forecasting. But are language models actually useful for time series? In a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade forecasting performance -- in most cases, the results even improve! We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and find that patching and attention structures perform similarly to LLM-based forecasters.

Paper Structure

This paper contains 25 sections, 8 figures, 24 tables.

Figures (8)

  • Figure 1: Overview of all LLM ablation methods. Figure (a) represents time series forecasting using an LLM as the base model. In some works, the LLM components are frozen jin2023timegruver2024large, while in others, they undergo fine-tuning zhou2024oneliu2024tamingcao2023tempo. Figure (b) shows the model with the LLM components removed, retaining only the remaining structure. Figure (c) replaces the LLM components with a single-layer self-attention mechanism. Figure (d) replaces the LLM components with a simple Transformer.
  • Figure 2: In the above examples, only OneFitsAll "w/ LLM" performs better than the ablation methods on ETTh1, but there is substantial overlap in bootstraped confidence intervals. The figures show the comparison of OneFitsAll, CALF, and Time-LLM using LLMs and ablations (i.e., w/o LLM, LLM2Attn, and LLM2Trsf) on ETTh1, ETTm2, and Electricity, and the vertical dashed lines represent the results from the original work. Others Figures for MSE and other datasets are available in \ref{['fig:mae_ci_left_data']} and \ref{['fig:mse_ci']} in the Appendix.
  • Figure 3: Ablation methods consume less time for inference while providing better forecasting performance. The figure above shows the inference time and prediction accuracy of Time-LLM, OneFitsAll, and CALF on ETTm2, Traffic, and Electricity datasets, averaged across prediction lengths. For more datasets and MSE metrics refer to \ref{['fig:compute_cost_mae_left_data']} and \ref{['fig:compute_cost_mse']} in the Appendix.
  • Figure 4: PAttn Model.
  • Figure 5: Ablation studies indicate that when different methods remove the LLM ("w/o LLM") or replace it with a single-layer attention ("LLM2Attn") or Transformer ("LLM2Trsf"), the performance on time series forecasting tasks with MAE metric does not decline and even improves, compared with original methods, such as "GPT-2" or "LLaMA". The vertical dashed line in the figures represents the results from the original paper. Above figures are from 'ETTh2', 'ETTm1', 'Illness', 'Weather', and 'Traffic' datasets.
  • ...and 3 more figures