Table of Contents
Fetching ...

NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time-Series Pretraining

Chenguo Lin, Xumeng Wen, Wei Cao, Congrui Huang, Jiang Bian, Stephen Lin, Zhirong Wu

TL;DR

NuTime tackles the challenge of scalable, cross-domain time-series representation learning where numerical scales vary widely. It introduces a Transformer-based model that patches inputs into non-overlapping windows and encodes each window by a normalized shape, mean, and std, feeding these tokens into a Transformer. The core contribution is the Numerically Multi-scaled Embedding (NME), which ensembles multiple scale-specific blocks across a spectrum of scales $k_i$ (ranging from $10^{-4}$ to $10^4$) and weights them by $\alpha_i(x)$, enabling robust encoding of scalars with arbitrary amplitudes, expressed as $\mathbf{e}(x) = \sum_{i=1}^n \alpha_i(x) \cdot \mathbf{y}_i(x)$ with $\alpha_i(x) = \frac{|\log^{-1}(|x| / k_i + \epsilon)|}{\sum_j |\log^{-1}(|x| / k_j + \epsilon)|}$. The model is pretrained with BYOL on a large, cross-domain dataset (~1.89M sequences) and evaluated across univariate/multivariate classification, few-shot learning, clustering, and anomaly detection, achieving state-of-the-art transfer performance. Overall, NuTime demonstrates that a general-purpose time-series foundation model is feasible, with strong cross-domain transfer and competitive downstream performance, while acknowledging decoding and hyperparameter considerations for future work.

Abstract

Recent research on time-series self-supervised models shows great promise in learning semantic representations. However, it has been limited to small-scale datasets, e.g., thousands of temporal sequences. In this work, we make key technical contributions that are tailored to the numerical properties of time-series data and allow the model to scale to large datasets, e.g., millions of temporal sequences. We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical amplitudes in a high-dimensional space, we propose a numerically multi-scaled embedding module enumerating all possible numerical scales for the scalars. The model undergoes pretraining with a simple contrastive objective on a large-scale dataset over a million sequences collected by merging existing public data. We study its transfer performance on a number of univariate and multivariate classification tasks, few shot learning, unsupervised clustering and anomaly detection benchmarks. Our method exhibits remarkable improvement against previous pretraining approaches and establishes the new state of the art, even compared with domain-specific non-learning-based methods. Code is available at: \url{https://github.com/chenguolin/NuTime}.

NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time-Series Pretraining

TL;DR

NuTime tackles the challenge of scalable, cross-domain time-series representation learning where numerical scales vary widely. It introduces a Transformer-based model that patches inputs into non-overlapping windows and encodes each window by a normalized shape, mean, and std, feeding these tokens into a Transformer. The core contribution is the Numerically Multi-scaled Embedding (NME), which ensembles multiple scale-specific blocks across a spectrum of scales (ranging from to ) and weights them by , enabling robust encoding of scalars with arbitrary amplitudes, expressed as with . The model is pretrained with BYOL on a large, cross-domain dataset (~1.89M sequences) and evaluated across univariate/multivariate classification, few-shot learning, clustering, and anomaly detection, achieving state-of-the-art transfer performance. Overall, NuTime demonstrates that a general-purpose time-series foundation model is feasible, with strong cross-domain transfer and competitive downstream performance, while acknowledging decoding and hyperparameter considerations for future work.

Abstract

Recent research on time-series self-supervised models shows great promise in learning semantic representations. However, it has been limited to small-scale datasets, e.g., thousands of temporal sequences. In this work, we make key technical contributions that are tailored to the numerical properties of time-series data and allow the model to scale to large datasets, e.g., millions of temporal sequences. We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical amplitudes in a high-dimensional space, we propose a numerically multi-scaled embedding module enumerating all possible numerical scales for the scalars. The model undergoes pretraining with a simple contrastive objective on a large-scale dataset over a million sequences collected by merging existing public data. We study its transfer performance on a number of univariate and multivariate classification tasks, few shot learning, unsupervised clustering and anomaly detection benchmarks. Our method exhibits remarkable improvement against previous pretraining approaches and establishes the new state of the art, even compared with domain-specific non-learning-based methods. Code is available at: \url{https://github.com/chenguolin/NuTime}.
Paper Structure (42 sections, 2 equations, 11 figures, 11 tables)

This paper contains 42 sections, 2 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: (a) Numerical scales of three temporal sequences from three datasets differ significantly. (b) Even a single sequence may contain multiple scales of numerical variations. The zoom-in view shows the local structure of small variations. Note that sequences are shifted above the x-axis and presented in a logarithmic scale for better visualizations.
  • Figure 2: Architecture Overview. The proposed model first patchifies the input sequence into non-overlapping windows. Each window is represented by its normalized shape, the window mean and the window std. Embeddings for the three components are concatenated and transformed to be fed as input tokens into a Transformer encoder. Details about the numerically multi-scaled embedding (NME) for scalar values are explained in Section \ref{['nme']}.
  • Figure 3: (a) Output of a Basic Building Block. The input and the output response for the basic building block of a linear layer and a LayerNorm with different multipliers $k$ set for the bias term. Only a single channel for the output is visualized. The function would saturate when the input is out of a scale related to $k$. (b) Numerically Multi-scaled Embedding. To encode an input scalar in arbitrary values, the proposed numerically multi-scaled embedding (NME) module ensembles outputs of multiple basic building blocks with different multipliers $k$ by a weighted average. (c) Input and Output Response Curves of the NME Module. The two figures show 5 channels of the output $\mathbf{e}(x)$, of a randomly initialized module and a pretrained module under log scale. The embedding module models a complex function which reflects multiple scales of variations.
  • Figure 4: Test accuracy critical difference diagrams of NuTime's performance versus supervised state-of-the-art methods across (a) 112 datasets from the UCR archive and (b) 26 datasets from the UEA archive. We report the results of NuTime as an ensemble of five runs, each finetuned with different random seeds using the same pretrained model.
  • Figure 5: Accuracy comparison of NuTime and (a) HIVE-COTE2.0 middlehurst2021hive, (b) MultiRocket tan2022multirocket, (c) InceptionTime ismail2020inceptiontime on 112 datasets from the UCR archive. Each subfigure's title displays a win/tie/loss comparison between NuTime and other methods. The two dotted lines indicate the 5% interval.
  • ...and 6 more figures