Table of Contents
Fetching ...

BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models

Zezhi Shao, Yujie Li, Fei Wang, Chengqing Yu, Yisong Fu, Tangwen Qian, Bin Xu, Boyu Diao, Yongjun Xu, Xueqi Cheng

TL;DR

This work tackles the data diversity gap in universal time-series forecasting by introducing BLAST, a pre-training corpus built from $3.21\times 10^{11}$ observations and shaped by a balanced sampling framework. BLAST characterizes each series with seven statistical metrics, discretizes them into a 61-dimensional representation, reduces dimensionality with UMAP, and applies grid sampling and grid mixup to ensure broad pattern coverage. Empirical results show BLAST-based pre-training yields state-of-the-art forecasting performance while reducing training tokens and hardware requirements, outperforming naive or stratified sampling regimes. The approach demonstrates that data diversity, explicitly engineered through grid-based sampling, markedly improves training efficiency and generalization for universal forecasting models, with practical implications for scalable time-series learning.

Abstract

The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.

BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models

TL;DR

This work tackles the data diversity gap in universal time-series forecasting by introducing BLAST, a pre-training corpus built from observations and shaped by a balanced sampling framework. BLAST characterizes each series with seven statistical metrics, discretizes them into a 61-dimensional representation, reduces dimensionality with UMAP, and applies grid sampling and grid mixup to ensure broad pattern coverage. Empirical results show BLAST-based pre-training yields state-of-the-art forecasting performance while reducing training tokens and hardware requirements, outperforming naive or stratified sampling regimes. The approach demonstrates that data diversity, explicitly engineered through grid-based sampling, markedly improves training efficiency and generalization for universal forecasting models, with practical implications for scalable time-series learning.

Abstract

The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.

Paper Structure

This paper contains 40 sections, 16 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Illustration of the large-scale time series forecasting pre-training dataset and various sampling methods.
  • Figure 2: The uneven distribution of the raw large-scale time series dataset collected by BLAST.
  • Figure 3: Pipeline for the balanced sampling: (i) constructing large-scale time series datasets, (ii) utilizing diverse metrics to comprehensively characterize time series, (iii) generating unified feature vectors and performing dimension reduction to visualize data imbalances, and (iv) implementing grid sampling and grid mixup to enhance the diversity of the training data.
  • Figure 4: Distribution of the raw dataset across key metrics.
  • Figure 5: Comparison of convergence speeds for different sampling methods.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3