Table of Contents
Fetching ...

Using Pre-trained LLMs for Multivariate Time Series Forecasting

Malcolm L. Wolff, Shenghao Yang, Kari Torkkola, Michael W. Mahoney

TL;DR

The paper investigates whether pre-trained Large Language Models can be repurposed for multivariate, multi-horizon time-series forecasting by learning lightweight embeddings that map time-series inputs into an LLM's token space while freezing most of the model. It introduces multivariate patching and targeted layer-norm fine-tuning as a practical strategy to leverage decoder-only LLMs (e.g., GPT-2, Flan-T5, MPT-7B) for forecasting, and it uses HTSR-based weight diagnostics to analyze embedding quality and generalization. Empirical results on retailer-demand data show that LLM-based approaches can approach or surpass state-of-the-art baselines like MQCNN, with MPT-7B and Flan-T5 variants performing best under certain configurations. The work highlights a promising direction for foundation-model transfer to time-series tasks, while acknowledging limitations and proposing future work to broaden scope and diagnostics. All results are validated with HTSR-based evidence linking spectral properties of layer weights to predictive performance.

Abstract

Pre-trained Large Language Models (LLMs) encapsulate large amounts of knowledge and take enormous amounts of compute to train. We make use of this resource, together with the observation that LLMs are able to transfer knowledge and performance from one domain or even modality to another seemingly-unrelated area, to help with multivariate demand time series forecasting. Attention in transformer-based methods requires something worth attending to -- more than just samples of a time-series. We explore different methods to map multivariate input time series into the LLM token embedding space. In particular, our novel multivariate patching strategy to embed time series features into decoder-only pre-trained Transformers produces results competitive with state-of-the-art time series forecasting models. We also use recently-developed weight-based diagnostics to validate our findings.

Using Pre-trained LLMs for Multivariate Time Series Forecasting

TL;DR

The paper investigates whether pre-trained Large Language Models can be repurposed for multivariate, multi-horizon time-series forecasting by learning lightweight embeddings that map time-series inputs into an LLM's token space while freezing most of the model. It introduces multivariate patching and targeted layer-norm fine-tuning as a practical strategy to leverage decoder-only LLMs (e.g., GPT-2, Flan-T5, MPT-7B) for forecasting, and it uses HTSR-based weight diagnostics to analyze embedding quality and generalization. Empirical results on retailer-demand data show that LLM-based approaches can approach or surpass state-of-the-art baselines like MQCNN, with MPT-7B and Flan-T5 variants performing best under certain configurations. The work highlights a promising direction for foundation-model transfer to time-series tasks, while acknowledging limitations and proposing future work to broaden scope and diagnostics. All results are validated with HTSR-based evidence linking spectral properties of layer weights to predictive performance.

Abstract

Pre-trained Large Language Models (LLMs) encapsulate large amounts of knowledge and take enormous amounts of compute to train. We make use of this resource, together with the observation that LLMs are able to transfer knowledge and performance from one domain or even modality to another seemingly-unrelated area, to help with multivariate demand time series forecasting. Attention in transformer-based methods requires something worth attending to -- more than just samples of a time-series. We explore different methods to map multivariate input time series into the LLM token embedding space. In particular, our novel multivariate patching strategy to embed time series features into decoder-only pre-trained Transformers produces results competitive with state-of-the-art time series forecasting models. We also use recently-developed weight-based diagnostics to validate our findings.
Paper Structure (27 sections, 4 equations, 9 figures, 2 tables)

This paper contains 27 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: High level architectural design of our experiments. The static + historic series decoder block MLP (MQCNN) is in fact a complex collection of MLPs for different forecasting horizons and quantiles. Pre-trained LLMs considered are GPT-2, Flan-T5, and MPT-7B.
  • Figure 2: Layer-level weight analysis identifies sub-optimal model architecture and predicts forecasting accuracy. Complementary cumulative distribution function (CCDF) of weight matrix spectrum for the first embedding layer (left) and the output layer (middle). The rightmost plot shows evolution of P50 quantile test loss across epochs for both architectures, where markers and lines are colored according to the fitted $\alpha$.
  • Figure 3: ESD of the embedding layer (top) and and the output layer (bottom), when the base FPT is GPT-2 (left) and Flan-T5 (middle). The input and output layers are liner. Red vertical lines correspond to the $\lambda_{\min}$ parameter of the PL distribution in \ref{['eq:pw']} chosen by minimizing the KS distance, and red dashed lines represent the PL distribution for the fitted $\alpha$, as in previous work martin2021predicting. Additionally, the tail of the complementary cumulative distribution function (CCDF) is shown for each layer (right). For the embedding layer, the CCDFs for both FPTs tend to follow a PL tail. However, the CCDF that corresponds to Flan-T5 has a much sleeper slope (i.e., better $\alpha$ metric) than the CCDF that corresponds to GPT-2. We investigate how HTSR metrics (e.g., $\alpha$) are predictive of model quality and illustrate details in Figure \ref{['fig:gpt2_vs_flant5_test_loss_and_htsr_metric']}. For both architectures, the CCDFs for the output layer have a convex kink around $10^4$, and overall, they do not demonstrate a strong evidence of (T)PL tails. In Figure \ref{['fig:linear_vs_mlp_esd']}, we show that replacing the linear output layer with an MLP removes the kink and results in an ESD that demonstrates (T)PL tail.
  • Figure 4: HTSR metrics predict forecasting accuracy across architectures (varying base FPTs) and within architecture across epochs. P50 quantile test loss by epoch for GPT-2- and Flan-T5-based architectures. Markers and lines are colored according to the $\alpha$ metric (left) and the stable rank metric (right). Within the same architecture over different epochs, a higher metric value generally results in a higher accuracy; at a fixed epoch between different architectures, a higher metric value generally results in a higher accuracy.
  • Figure 5: ESD of the first embedding layer (top), the second embedding layer (middle, this layer is specific for MLP, and hence there is none on the left), and the output layer (bottom), when the embedding/decoding layer is linear (left) and 2-layer MLP (middle). Red vertical lines correspond to the $\lambda_{\min}$ parameter of the PL distribution in \ref{['eq:pw']} chosen by minimizing the KS distance, and red dashed lines represent the PL distribution for the fitted $\alpha$. Additionally, the tail of the complementary cumulative distribution function (CCDF) is shown for each layer (right). CCDFs of the architecture that use MLP embedding/decoding not only demonstrate (T)PL tails, but also have steeper slopes (i.e., better $\alpha$ metric) than the CCDFs of the architecture that use linear embedding/decoding. The close relationship between HTSR metrics (e.g., the $\alpha$ metric) and model quality is illustrated in Figure \ref{['fig:linear_vs_mlp_test_loss_and_htsr_metric']}.
  • ...and 4 more figures