Table of Contents
Fetching ...

One Fits All:Power General Time Series Analysis by Pretrained LM

Tian Zhou, PeiSong Niu, Xue Wang, Liang Sun, Rong Jin

TL;DR

The paper proposes Frozen Pretrained Transformer (FPT) to unify time series analysis by transferring frozen self-attention from NLP/CV backbones to a wide range of tasks. It employs patch-based tokens and trains only input embeddings, normalization, and output heads, enabling cross-domain knowledge transfer while keeping the core transformer fixed. Across seven core tasks—imputation, classification, anomaly detection, long-/short-term forecasting, and few-shot/zero-shot forecasting—the GPT-2–based FPT achieves state-of-the-art or competitive performance and demonstrates universality across backbones like BERT and BEiT. The authors further elucidate the connection between self-attention and PCA to explain universality, discuss practical training/inference costs, and outline future directions such as parameter-efficient fine-tuning and n-gram analyses to deepen understanding of transformer generality.

Abstract

Although we have witnessed great success of pre-trained models in natural language processing (NLP) and computer vision (CV), limited progress has been made for general time series analysis. Unlike NLP and CV where a unified model can be used to perform different tasks, specially designed approach still dominates in each time series analysis task such as classification, anomaly detection, forecasting, and few-shot learning. The main challenge that blocks the development of pre-trained model for time series analysis is the lack of a large amount of data for training. In this work, we address this challenge by leveraging language or CV models, pre-trained from billions of tokens, for time series analysis. Specifically, we refrain from altering the self-attention and feedforward layers of the residual blocks in the pre-trained language or image model. This model, known as the Frozen Pretrained Transformer (FPT), is evaluated through fine-tuning on all major types of tasks involving time series. Our results demonstrate that pre-trained models on natural language or images can lead to a comparable or state-of-the-art performance in all main time series analysis tasks, as illustrated in Figure 1. We also found both theoretically and empirically that the self-attention module behaviors similarly to principle component analysis (PCA), an observation that helps explains how transformer bridges the domain gap and a crucial step towards understanding the universality of a pre-trained transformer.The code is publicly available at https://github.com/DAMO-DI-ML/One_Fits_All.

One Fits All:Power General Time Series Analysis by Pretrained LM

TL;DR

The paper proposes Frozen Pretrained Transformer (FPT) to unify time series analysis by transferring frozen self-attention from NLP/CV backbones to a wide range of tasks. It employs patch-based tokens and trains only input embeddings, normalization, and output heads, enabling cross-domain knowledge transfer while keeping the core transformer fixed. Across seven core tasks—imputation, classification, anomaly detection, long-/short-term forecasting, and few-shot/zero-shot forecasting—the GPT-2–based FPT achieves state-of-the-art or competitive performance and demonstrates universality across backbones like BERT and BEiT. The authors further elucidate the connection between self-attention and PCA to explain universality, discuss practical training/inference costs, and outline future directions such as parameter-efficient fine-tuning and n-gram analyses to deepen understanding of transformer generality.

Abstract

Although we have witnessed great success of pre-trained models in natural language processing (NLP) and computer vision (CV), limited progress has been made for general time series analysis. Unlike NLP and CV where a unified model can be used to perform different tasks, specially designed approach still dominates in each time series analysis task such as classification, anomaly detection, forecasting, and few-shot learning. The main challenge that blocks the development of pre-trained model for time series analysis is the lack of a large amount of data for training. In this work, we address this challenge by leveraging language or CV models, pre-trained from billions of tokens, for time series analysis. Specifically, we refrain from altering the self-attention and feedforward layers of the residual blocks in the pre-trained language or image model. This model, known as the Frozen Pretrained Transformer (FPT), is evaluated through fine-tuning on all major types of tasks involving time series. Our results demonstrate that pre-trained models on natural language or images can lead to a comparable or state-of-the-art performance in all main time series analysis tasks, as illustrated in Figure 1. We also found both theoretically and empirically that the self-attention module behaviors similarly to principle component analysis (PCA), an observation that helps explains how transformer bridges the domain gap and a crucial step towards understanding the universality of a pre-trained transformer.The code is publicly available at https://github.com/DAMO-DI-ML/One_Fits_All.
Paper Structure (49 sections, 8 theorems, 4 equations, 9 figures, 29 tables)

This paper contains 49 sections, 8 theorems, 4 equations, 9 figures, 29 tables.

Key Result

Lemma 8.1

Let the Jacobian $J = \left[\frac{\partial f_i(X)}{\partial x_j}\right]_{i,j=1}^N$ represent the gradient $f(X)$ w.r.t the input pattern, then we have $$ where $$ and $$.

Figures (9)

  • Figure 2: Model architecture. Pre-trained parameters are transferred to the time series forecasting tasks. Self-attention and Feedforward layers in the transformer blocks are frozen while only the embedding layer, normalization layers, and output layer require training.
  • Figure 3: Model comparison in classification. The results are averaged from 10 subsets of UEA. Appendix \ref{['appendix:classification_full']} shows the full results.
  • Figure 4: (a, c) The performance and token similarity within samples with respect to each layer with different random mixed ratio. Pre-trained parameters are mixed with random initial parameters according to certain proportions. (b) Token similarity within samples when replacing the attention with PCA.
  • Figure 5: Visualization of imputation, long-term forecasting and few-shot forecasting.
  • Figure 6: The performance and token similarity within samples with respect to each layer with different random replace ratios. Pretrained parameters are replaced by random initial parameters according to certain proportions.
  • ...and 4 more figures

Theorems & Definitions (12)

  • Lemma 8.1
  • Theorem 1
  • Theorem E.1: informal
  • Theorem E.2: formal statement of Theorem \ref{['thm:1']}
  • proof
  • Theorem E.3: formal statement of Theorem \ref{['thm:2']}
  • proof
  • Theorem E.4: informal
  • Lemma G.1
  • proof
  • ...and 2 more