Table of Contents
Fetching ...

Synthetic Series-Symbol Data Generation for Time Series Foundation Models

Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang

TL;DR

This work tackles data scarcity and imbalance in time series foundation modeling by introducing a dual-modality series-symbol ($S^2$) data generation pipeline grounded in Takens' theorem and symbolic dynamics. It then presents SymTime, a pre-trained architecture that jointly optimizes time-series representation and symbolic semantics through masked modeling and cross-modal contrastive learning, aided by momentum distillation. Experiments across five TSA tasks show that SymTime, pretrained on $S^2$, achieves competitive or superior performance with favorable efficiency compared to real-data–trained foundations, and scales favorably with larger synthetic datasets. The approach demonstrates strong representation alignment between temporal patterns and symbolic expressions, offering a scalable path toward mitigating data scarcity in TSA and enabling broader generalization across domains.

Abstract

Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.

Synthetic Series-Symbol Data Generation for Time Series Foundation Models

TL;DR

This work tackles data scarcity and imbalance in time series foundation modeling by introducing a dual-modality series-symbol () data generation pipeline grounded in Takens' theorem and symbolic dynamics. It then presents SymTime, a pre-trained architecture that jointly optimizes time-series representation and symbolic semantics through masked modeling and cross-modal contrastive learning, aided by momentum distillation. Experiments across five TSA tasks show that SymTime, pretrained on , achieves competitive or superior performance with favorable efficiency compared to real-data–trained foundations, and scales favorably with larger synthetic datasets. The approach demonstrates strong representation alignment between temporal patterns and symbolic expressions, offering a scalable path toward mitigating data scarcity in TSA and enabling broader generalization across domains.

Abstract

Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.

Paper Structure

This paper contains 123 sections, 16 equations, 25 figures, 35 tables.

Figures (25)

  • Figure 1: The connection between time series and symbolic expressions (taking the Lorentz system as an example) kuznetsov2020lorenz.
  • Figure 2: $S^2$ dataset generation mechanism (left) and SymTime network architecture (right).
  • Figure 3: The process of building a binary tree when sampling symbolic expressions. (a) tree construction; (b) variable assignment to leaf nodes; (c) unary operator insertion.
  • Figure 4: Radviz visualization of $S^2$ and Monash datasets.
  • Figure 5: Model performance comparison with the state-of-the-art models in terms of five tasks (left). Complexity analysis on long time series forecasting tasks (ETTh1 dataset, forecasting length is 720 with 96 look-back windows) (right). Note that since the original backbone of Time-LLM Time-LLM has too many parameters, we replaced it with GPT2 GPT-2.
  • ...and 20 more figures