Table of Contents
Fetching ...

From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting

Xinyu Zhang, Shanshan Feng, Xutao Li

TL;DR

This paper questions the efficacy of pre-trained LLM backbones for time series forecasting by showing that small datasets cause Encoder-Decoder components to overfit, masking backbone capabilities. It proposes three large-scale pre-training strategies to decouple Encoder-Decoder from the backbone, enabling zero-shot and few-shot evaluations that better reveal the backbone's true potential. Across seven real-world datasets, the findings reveal only limited advantages from LLM backbones, with performance often dominated by the Encoder-Decoder and generally requiring substantial time-series pre-training (tens of millions of samples) to approach GPT-2-like performance. The work implies that, for now, transformers trained on time-series data can be more effective on small to medium datasets, and highlights the need for dataset-scale pre-training and architecture-tuning specifically for time-series tasks.

Abstract

Using pre-trained large language models (LLMs) as the backbone for time series prediction has recently gained significant research interest. However, the effectiveness of LLM backbones in this domain remains a topic of debate. Based on thorough empirical analyses, we observe that training and testing LLM-based models on small datasets often leads to the Encoder and Decoder becoming overly adapted to the dataset, thereby obscuring the true predictive capabilities of the LLM backbone. To investigate the genuine potential of LLMs in time series prediction, we introduce three pre-training models with identical architectures but different pre-training strategies. Thereby, large-scale pre-training allows us to create unbiased Encoder and Decoder components tailored to the LLM backbone. Through controlled experiments, we evaluate the zero-shot and few-shot prediction performance of the LLM, offering insights into its capabilities. Extensive experiments reveal that although the LLM backbone demonstrates some promise, its forecasting performance is limited. Our source code is publicly available in the anonymous repository: https://anonymous.4open.science/r/LLM4TS-0B5C.

From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting

TL;DR

This paper questions the efficacy of pre-trained LLM backbones for time series forecasting by showing that small datasets cause Encoder-Decoder components to overfit, masking backbone capabilities. It proposes three large-scale pre-training strategies to decouple Encoder-Decoder from the backbone, enabling zero-shot and few-shot evaluations that better reveal the backbone's true potential. Across seven real-world datasets, the findings reveal only limited advantages from LLM backbones, with performance often dominated by the Encoder-Decoder and generally requiring substantial time-series pre-training (tens of millions of samples) to approach GPT-2-like performance. The work implies that, for now, transformers trained on time-series data can be more effective on small to medium datasets, and highlights the need for dataset-scale pre-training and architecture-tuning specifically for time-series tasks.

Abstract

Using pre-trained large language models (LLMs) as the backbone for time series prediction has recently gained significant research interest. However, the effectiveness of LLM backbones in this domain remains a topic of debate. Based on thorough empirical analyses, we observe that training and testing LLM-based models on small datasets often leads to the Encoder and Decoder becoming overly adapted to the dataset, thereby obscuring the true predictive capabilities of the LLM backbone. To investigate the genuine potential of LLMs in time series prediction, we introduce three pre-training models with identical architectures but different pre-training strategies. Thereby, large-scale pre-training allows us to create unbiased Encoder and Decoder components tailored to the LLM backbone. Through controlled experiments, we evaluate the zero-shot and few-shot prediction performance of the LLM, offering insights into its capabilities. Extensive experiments reveal that although the LLM backbone demonstrates some promise, its forecasting performance is limited. Our source code is publicly available in the anonymous repository: https://anonymous.4open.science/r/LLM4TS-0B5C.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Performance comparison on small Datasets across different pre-training knowledge. Comparison of prediction performance for TP-GPT, DI-GPT, and TS-GPT models on various small public datasets. Despite differing pre-training methods, the models exhibit similar performance, indicating that on limited data, the Encoder and Decoder can adapt effectively without relying on additional pre-trained parameters. This suggests the frozen Backbone's potential remains unexplored in such settings.
  • Figure 2: Comparison of relative errors for different models against PatchTST on varying dataset sizes. LLM-based models exhibit lower relative errors on small datasets but show higher relative errors as dataset size increases. In contrast, TSMixer and ModernTCN maintain consistent performance across all dataset sizes.
  • Figure 3: LLM-based Time-series Model Architecture
  • Figure 4: Framework for Pre-training Strategies
  • Figure 5: (a) Illustrates the distribution of GPT-2's vocabulary tokens alongside token representations from various time series datasets, showing a clear separation between these two distributions. (b) Demonstrates tokens from the ETTH1 dataset are mapped from outside to inside the GPT-2 vocabulary range using a cross-attention mechanism.
  • ...and 1 more figures