Table of Contents
Fetching ...

TimesBERT: A BERT-Style Foundation Model for Time Series Understanding

Haoran Zhang, Yong Liu, Yunzhong Qiu, Haixuan Liu, Zhongyi Pei, Jianmin Wang, Mingsheng Long

TL;DR

TimesBERT introduces a BERT-style encoder for time series understanding, treating multivariate time series as multisentence documents and repurposing functional tokens for multi-granularity reasoning. It pre-trains on a large corpus of $260$ billion time points with two objectives, Masked Patch Modeling and Functional Token Prediction, using a unified time-series embedding and an encoder with $L=12$, $H=768$, $A=12$ and a $512$-token context. The model achieves state-of-the-art across four understanding tasks—classification, imputation, anomaly detection, and short-term forecasting—across hundreds of real-world datasets, demonstrating strong transferability and cross-domain robustness. Ablation studies show the value of the FTP task and multivariate modeling, and the results highlight the importance of time-series native pre-training over cross-modal initialization. This work positions TimesBERT as a versatile foundation model for time series understanding with practical implications for cross-domain analytics.

Abstract

Time series analysis is crucial in diverse scenarios. Beyond forecasting, considerable real-world tasks are categorized into classification, imputation, and anomaly detection, underscoring different capabilities termed time series understanding in this paper. While GPT-style models have been positioned as foundation models for time series forecasting, the BERT-style architecture, which has made significant advances in natural language understanding, has not been fully unlocked for time series understanding, possibly attributed to the undesirable dropout of essential elements of BERT. In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. In addition to a natural adaptation of masked modeling, we propose a parallel task of functional token prediction to embody vital multi-granularity structures. Our model is pre-trained on 260 billion time points across diverse domains. Leveraging multi-granularity representations, TimesBERT achieves state-of-the-art performance across four typical downstream understanding tasks, outperforming task-specific models and language pre-trained backbones, positioning it as a versatile foundation model for time series understanding.

TimesBERT: A BERT-Style Foundation Model for Time Series Understanding

TL;DR

TimesBERT introduces a BERT-style encoder for time series understanding, treating multivariate time series as multisentence documents and repurposing functional tokens for multi-granularity reasoning. It pre-trains on a large corpus of billion time points with two objectives, Masked Patch Modeling and Functional Token Prediction, using a unified time-series embedding and an encoder with , , and a -token context. The model achieves state-of-the-art across four understanding tasks—classification, imputation, anomaly detection, and short-term forecasting—across hundreds of real-world datasets, demonstrating strong transferability and cross-domain robustness. Ablation studies show the value of the FTP task and multivariate modeling, and the results highlight the importance of time-series native pre-training over cross-modal initialization. This work positions TimesBERT as a versatile foundation model for time series understanding with practical implications for cross-domain analytics.

Abstract

Time series analysis is crucial in diverse scenarios. Beyond forecasting, considerable real-world tasks are categorized into classification, imputation, and anomaly detection, underscoring different capabilities termed time series understanding in this paper. While GPT-style models have been positioned as foundation models for time series forecasting, the BERT-style architecture, which has made significant advances in natural language understanding, has not been fully unlocked for time series understanding, possibly attributed to the undesirable dropout of essential elements of BERT. In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. In addition to a natural adaptation of masked modeling, we propose a parallel task of functional token prediction to embody vital multi-granularity structures. Our model is pre-trained on 260 billion time points across diverse domains. Leveraging multi-granularity representations, TimesBERT achieves state-of-the-art performance across four typical downstream understanding tasks, outperforming task-specific models and language pre-trained backbones, positioning it as a versatile foundation model for time series understanding.

Paper Structure

This paper contains 41 sections, 7 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: TimesBERT inherits and extends the pre-training and fine-tuning paradigm established by BERT, which learns generalizable representation through pre-training on large-scale datasets of arbitrary multivariate time series, and adapts the foundation model to diverse tasks of time series understanding.
  • Figure 2: A multivariate time series is worth a natural language document. We propose to fully repurpose BERT for learning structured representations of time series. The representations embodying different granularities can facilitate diverse time series understanding tasks.
  • Figure 3: Comparison between GPT radford2018improving, BERT devlin2018bert, and TimesBERT on embedding, backbone, and training objective. In contrast to BERT's sentence pair formulation, we implement an embedding approach for data with an arbitrary number of variates and design corresponding functional tokens to accommodate the inherent irregularity of time series variates.
  • Figure 4: Illustration of the TimesBERT architecture and pre-training objectives. The input multivariate time series is embedded into a token sequence for the transformer encoder following a unified time series embedding process that includes patching, function token insertion, and flattening. Following the output from the backbone, the reconstructed patches and functional tokens are respectively fed into corresponding pre-training tasks including MPM and FTP, ultimately forming the joint optimization objective.
  • Figure 5: Overall Performance of TimesBERT.
  • ...and 8 more figures