Table of Contents
Fetching ...

Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models

Xu Liu, Taha Aksu, Juncheng Liu, Qingsong Wen, Yuxuan Liang, Caiming Xiong, Silvio Savarese, Doyen Sahoo, Junnan Li, Chenghao Liu

TL;DR

The paper surveys synthetic data as a scalable solution to data scarcity, regulatory constraints, and quality biases in time series foundation models (TSFMs) and time series language models (TSLLMs). It categorizes generation methods and usage across the model lifecycle, detailing four TSFM generation approaches (ForecastPFN, TimesFM, Chronos, Moment) and diverse TSLLM text-generation pipelines (template-based, LLM-based, web-crawled) along with real-synthetic data mixtures. It highlights current gains, limitations, and the need for systematic data-integration frameworks, data-driven generative methods, and opportunities to finetune with synthetic data, plus broader future directions like human-in-the-loop lifecycles and self-improvement. The findings provide a structured roadmap for creating high-quality, diverse synthetic datasets that enable robust, generalizable multimodal time series reasoning and forecasting in practical applications.

Abstract

Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs), enabling generalized learning and integrating contextual information. However, their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. Synthetic data emerge as a viable solution, addressing these challenges by offering scalable, unbiased, and high-quality alternatives. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.

Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models

TL;DR

The paper surveys synthetic data as a scalable solution to data scarcity, regulatory constraints, and quality biases in time series foundation models (TSFMs) and time series language models (TSLLMs). It categorizes generation methods and usage across the model lifecycle, detailing four TSFM generation approaches (ForecastPFN, TimesFM, Chronos, Moment) and diverse TSLLM text-generation pipelines (template-based, LLM-based, web-crawled) along with real-synthetic data mixtures. It highlights current gains, limitations, and the need for systematic data-integration frameworks, data-driven generative methods, and opportunities to finetune with synthetic data, plus broader future directions like human-in-the-loop lifecycles and self-improvement. The findings provide a structured roadmap for creating high-quality, diverse synthetic datasets that enable robust, generalizable multimodal time series reasoning and forecasting in practical applications.

Abstract

Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs), enabling generalized learning and integrating contextual information. However, their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. Synthetic data emerge as a viable solution, addressing these challenges by offering scalable, unbiased, and high-quality alternatives. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.

Paper Structure

This paper contains 39 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Two taxonomies for analyzing synthetic data in TSLLMs: the top figure categorizes based on text generation methods, while the bottom one based on the composition of synthetic and real data.
  • Figure 2: Illustration of the usage of synthetic data across pretraining, finetuning, and evaluation stages in TSLLMs.