Table of Contents
Fetching ...

An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting

Rui Cao, Qiao Wang

TL;DR

This work evaluates the viability of LLMTIME, a Large Language Model–based approach, for time series forecasting and contrasts it with traditional ARIMA baselines. By testing across real-world datasets (Darts and Monash), macroeconomic series, and synthetic noisy almost-periodic signals, it shows that LLMTIME generally underperforms ARIMA, especially for series with trends, seasonality, or multiple frequencies, and its accuracy deteriorates as signal magnitude grows. LLMTIME relies on a digit-wise tokenization and percentile-based normalization with an offset, but its zero-shot forecasting capability is limited, indicating a gap between LLM pretraining gains and time-series priors needed for robust forecasting. The results underscore that traditional methods like ARIMA remain strong baselines for diverse time-series data, while highlighting avenues for future research in integrating LLMs with time-series priors and uncertainty-aware forecasting.

Abstract

This research examines the use of Large Language Models (LLMs) in predicting time series, with a specific focus on the LLMTIME model. Despite the established effectiveness of LLMs in tasks such as text generation, language translation, and sentiment analysis, this study highlights the key challenges that large language models encounter in the context of time series prediction. We assess the performance of LLMTIME across multiple datasets and introduce classical almost periodic functions as time series to gauge its effectiveness. The empirical results indicate that while large language models can perform well in zero-shot forecasting for certain datasets, their predictive accuracy diminishes notably when confronted with diverse time series data and traditional signals. The primary finding of this study is that the predictive capacity of LLMTIME, similar to other LLMs, significantly deteriorates when dealing with time series data that contain both periodic and trend components, as well as when the signal comprises complex frequency components.

An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting

TL;DR

This work evaluates the viability of LLMTIME, a Large Language Model–based approach, for time series forecasting and contrasts it with traditional ARIMA baselines. By testing across real-world datasets (Darts and Monash), macroeconomic series, and synthetic noisy almost-periodic signals, it shows that LLMTIME generally underperforms ARIMA, especially for series with trends, seasonality, or multiple frequencies, and its accuracy deteriorates as signal magnitude grows. LLMTIME relies on a digit-wise tokenization and percentile-based normalization with an offset, but its zero-shot forecasting capability is limited, indicating a gap between LLM pretraining gains and time-series priors needed for robust forecasting. The results underscore that traditional methods like ARIMA remain strong baselines for diverse time-series data, while highlighting avenues for future research in integrating LLMs with time-series priors and uncertainty-aware forecasting.

Abstract

This research examines the use of Large Language Models (LLMs) in predicting time series, with a specific focus on the LLMTIME model. Despite the established effectiveness of LLMs in tasks such as text generation, language translation, and sentiment analysis, this study highlights the key challenges that large language models encounter in the context of time series prediction. We assess the performance of LLMTIME across multiple datasets and introduce classical almost periodic functions as time series to gauge its effectiveness. The empirical results indicate that while large language models can perform well in zero-shot forecasting for certain datasets, their predictive accuracy diminishes notably when confronted with diverse time series data and traditional signals. The primary finding of this study is that the predictive capacity of LLMTIME, similar to other LLMs, significantly deteriorates when dealing with time series data that contain both periodic and trend components, as well as when the signal comprises complex frequency components.
Paper Structure (18 sections, 3 equations, 13 figures, 1 table)

This paper contains 18 sections, 3 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Figure (\ref{['a']}) illustrates the impact of forecasting the AirPassengersDataset dataset using the LLMTIME model. It is evident that as the time series values increase gradually over time, the LLMTIME model predicts the final series value accurately but then sharply decreases, resulting in significantly poorer forecasting performance compared to the ARIMA model. This discrepancy is particularly noticeable in the FredMd dataset at Monash. Figure (\ref{['b']}) displays the application of the ARIMA model for predicting the AirPassengersDataset dataset. Regardless of the waveform or periodicity of the predictions, ARIMA consistently produces excellent results. Moreover, ARIMA outperforms the LLMTIME model significantly in terms of overall Mean Squared Error (MSE) metrics. Further experimental results are elaborated in Appendix \ref{['A']}.
  • Figure 2: When forecasting the final data point, it is evident from the provided graph that the performance based on LLMTIME experiences a significant decline.
  • Figure 3: We predict the overall monetary worth of the UK's exports spanning from January 1989 to December 2023, quantified in millions of dollars and recorded on a monthly basis. (Refer to Appendix \ref{['C']} for more information).
  • Figure 4: Utilizing the LLMTIME (\ref{['4a']}) and ARIMA (\ref{['4b']}) models, we anticipate the time series of the artificial signal $f(t) = cos(2\pi t)+cos(2t)+noise$, where $noise$ represents Gaussian noise with an average of 0 and a standard deviation of 0.1. Choose 500 data points from the function $f(t)$ within the range of 0 to 8$\pi$ to create a series, and forecast the values of the subsequent 100 points in the series based on the initial 400 data points.
  • Figure 5: The graph illustrates that the x-axis depicts the standard deviation of Gaussian noise, while the y-axis represents the MSE value. The black and blue bars correspond to LLMTIME and ARIMA, respectively. It can be inferred that across all four scenarios, the MSE associated with LLMTIME is significantly greater than that of ARIMA.
  • ...and 8 more figures