Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation
Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, Jing Liu
TL;DR
The paper tackles data scarcity and distribution imbalance in time-series foundation models by introducing Series-Symbol (S2), a dual-modality data generator that creates large-scale time-series data paired with symbolic expressions $Y=f(X)$. It trains SymTime, a Transformer-based time-series encoder plus a DistilBERT-based symbol encoder, using mask-based pretraining (MTM/MLM) and cross-modal contrastive learning with momentum distillation on a synthetic S2 dataset containing $25\mathrm{M}$ series-symbol pairs. Empirically, SymTime achieves competitive or state-of-the-art results across five TSA tasks (long- and short-term forecasting, classification, imputation, anomaly detection) while using a smaller model footprint than comparable foundation models pre-trained on real data. The work demonstrates that unrestricted, symbolically grounded synthetic data can substantially improve generalization and reduce reliance on large real-world labeled datasets in time-series analysis.
Abstract
Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as data scarcity and data imbalance continue to hinder their development. To address this, we consider modeling complex systems through symbolic expressions that serve as semantic descriptors of time series. Building on this concept, we introduce a series-symbol (S2) dual-modulity data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic representations. Leveraging the S2 dataset, we develop SymTime, a pre-trained foundation model for TSA. SymTime demonstrates competitive performance across five major TSA tasks when fine-tuned with downstream task, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of dual-modality data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.
