Table of Contents
Fetching ...

Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization

Peiliang Gong, Emadeldeen Eldele, Min Wu, Zhenghua Chen, Xiaoli Li, Daoqiang Zhang

TL;DR

ProtoN-FM introduces ProtoNorm, a prototype-guided dynamic normalization that selects per-sample LayerNorm modules via a gating network, enabling adaptive normalization to heterogeneous time series data. Prototypes are updated with EMA and constrained by an orthogonality loss to maintain diverse distribution anchors, while self-supervised contrastive pretraining aligns representations across datasets. The approach is modular and can drop into existing Transformer architectures, including MOMENT and Moirai, with minimal changes. Empirically, ProtoN-FM yields consistent improvements in both classification and forecasting under in-distribution and out-of-distribution conditions, demonstrating robust generalization across diverse TS domains.

Abstract

Foundation models have achieved remarkable success across diverse machine-learning domains through large-scale pretraining on large, diverse datasets. However, pretraining on such datasets introduces significant challenges due to substantial mismatches in data distributions, a problem particularly pronounced with time series data. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism (ProtoNorm), where learned prototypes encapsulate distinct data distributions, and sample-to-prototype affinity determines the appropriate normalization layer. This mechanism effectively captures the heterogeneity of time series characteristics, aligning pretrained representations with downstream tasks. Through comprehensive empirical evaluation, we demonstrate that our method significantly outperforms conventional pretraining techniques across both classification and forecasting tasks, while effectively mitigating the adverse effects of distribution shifts during pretraining. Incorporating ProtoNorm is as simple as replacing a single line of code. Extensive experiments on diverse real-world time series benchmarks validate the robustness and generalizability of our approach, advancing the development of more versatile time series foundation models.

Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization

TL;DR

ProtoN-FM introduces ProtoNorm, a prototype-guided dynamic normalization that selects per-sample LayerNorm modules via a gating network, enabling adaptive normalization to heterogeneous time series data. Prototypes are updated with EMA and constrained by an orthogonality loss to maintain diverse distribution anchors, while self-supervised contrastive pretraining aligns representations across datasets. The approach is modular and can drop into existing Transformer architectures, including MOMENT and Moirai, with minimal changes. Empirically, ProtoN-FM yields consistent improvements in both classification and forecasting under in-distribution and out-of-distribution conditions, demonstrating robust generalization across diverse TS domains.

Abstract

Foundation models have achieved remarkable success across diverse machine-learning domains through large-scale pretraining on large, diverse datasets. However, pretraining on such datasets introduces significant challenges due to substantial mismatches in data distributions, a problem particularly pronounced with time series data. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism (ProtoNorm), where learned prototypes encapsulate distinct data distributions, and sample-to-prototype affinity determines the appropriate normalization layer. This mechanism effectively captures the heterogeneity of time series characteristics, aligning pretrained representations with downstream tasks. Through comprehensive empirical evaluation, we demonstrate that our method significantly outperforms conventional pretraining techniques across both classification and forecasting tasks, while effectively mitigating the adverse effects of distribution shifts during pretraining. Incorporating ProtoNorm is as simple as replacing a single line of code. Extensive experiments on diverse real-world time series benchmarks validate the robustness and generalizability of our approach, advancing the development of more versatile time series foundation models.

Paper Structure

This paper contains 51 sections, 8 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: (a) Distributional shifts exist among three UCR time series datasets. (b) Fine-tuning performance comparison on these datasets after different pretraining strategies. Individual refers to pretraining and fine-tuning a Transformer model on each dataset separately. Vanilla denotes pretraining the foundation model on multiple datasets without additional design considerations. In ProtoN-FM, we utilize the same multi-dataset pretraining, but incorporate our prototype-guided dynamic normalization mechanism, resulting in superior performance across diverse datasets.
  • Figure 2: Framework comparison between vanilla Transformer and our ProtoN-FM. (a) Vanilla Transformer with standard LayerNorm, employing fixed normalization parameters across all inputs. (b) Our ProtoN-FM with ProtoNorm mechanism for dynamic LayerNorm assignment via prototype-guided gating. Each input is assigned appropriate LayerNorm parameters based on its similarity to learned prototypes, where these prototypes undergo continuous refinement through EMA updates during training.
  • Figure 3: Visualization of learned prototypes and sample features. Prototypes capture the unique distribution patterns of each cluster.
  • Figure 4: Comparative analysis of model performance across classification and forecasting tasks. Full results are listed in Tables \ref{['tab:moment_ucr_comparison']} and \ref{['tab:moirai_iid_mae_comparison']} in Appendix. (a) Classification accuracy evaluation across 91 UCR datasets without fine-tuning. (b) Quantitative assessment via normalized MAE metrics and frequency of optimal performance on the Monash benchmark.
  • Figure 5: Scaling efficiency analysis. (a) Average classification accuracy of ProtoN-FM across 91 UCR datasets with varying prototype quantities. Full results are listed in Table \ref{['tab:different_prototypes_ucr_results']} in Appendix. (b) Pretraining dataset scale's impact on classification accuracy, evaluated across two UCR datasets.
  • ...and 3 more figures