Synthetic Series-Symbol Data Generation for Time Series Foundation Models
Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang
TL;DR
This work tackles data scarcity and imbalance in time series foundation modeling by introducing a dual-modality series-symbol ($S^2$) data generation pipeline grounded in Takens' theorem and symbolic dynamics. It then presents SymTime, a pre-trained architecture that jointly optimizes time-series representation and symbolic semantics through masked modeling and cross-modal contrastive learning, aided by momentum distillation. Experiments across five TSA tasks show that SymTime, pretrained on $S^2$, achieves competitive or superior performance with favorable efficiency compared to real-data–trained foundations, and scales favorably with larger synthetic datasets. The approach demonstrates strong representation alignment between temporal patterns and symbolic expressions, offering a scalable path toward mitigating data scarcity in TSA and enabling broader generalization across domains.
Abstract
Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.
