Table of Contents
Fetching ...

TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model

Defu Cao, Wen Ye, Yizhou Zhang, Yan Liu

TL;DR

The paper tackles the need for a generalizable foundation model for time series that can handle missing data, multi-resolution sampling, and uncertainty. It proposes TimeDiT, a diffusion-transformer framework with a unified masking scheme and physics-informed sampling that injects PDE priors during inference. TimeDiT demonstrates strong zero-shot and fine-tuned performance across forecasting, imputation, anomaly detection, and data generation, with notable gains in uncertainty quantification and domain knowledge integration. By serving as a proto-foundation model, TimeDiT bridges the gap between universal temporal modeling and domain-specific needs, offering efficient sampling and flexible integration of external knowledge.

Abstract

Foundation models, particularly Large Language Models (LLMs), have revolutionized text and video processing, yet time series data presents distinct challenges for such approaches due to domain-specific features such as missing values, multi-resolution characteristics, etc. Furthermore, the de-facto autoregressive transformers tend to learn deterministic temporal dependencies within pre-trained data while overlooking inherent uncertainties and lacking integration of physical constraints. In this paper, we introduce TimeDiT, a diffusion transformer model that synergistically combines transformer-based temporal dependency learning with diffusion-based probabilistic sampling. TimeDiT employs a unified masking mechanism to harmonize the training and inference process across diverse tasks while introducing a theoretically grounded, finetuning-free model editing strategy that enables flexible integration of external knowledge during sampling. Acknowledging the challenges of unifying multiple downstream tasks under a single model, our systematic evaluation demonstrates TimeDiT's effectiveness both in fundamental tasks, i.e., forecasting and imputation, through zero-shot/fine-tuning; and in domain tasks, i.e., multi-resolution forecasting, anomaly detection, and data generation, establishing it as a \textit{proto-foundation model} that bridges the gap between general-purpose and domain-specific models.

TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model

TL;DR

The paper tackles the need for a generalizable foundation model for time series that can handle missing data, multi-resolution sampling, and uncertainty. It proposes TimeDiT, a diffusion-transformer framework with a unified masking scheme and physics-informed sampling that injects PDE priors during inference. TimeDiT demonstrates strong zero-shot and fine-tuned performance across forecasting, imputation, anomaly detection, and data generation, with notable gains in uncertainty quantification and domain knowledge integration. By serving as a proto-foundation model, TimeDiT bridges the gap between universal temporal modeling and domain-specific needs, offering efficient sampling and flexible integration of external knowledge.

Abstract

Foundation models, particularly Large Language Models (LLMs), have revolutionized text and video processing, yet time series data presents distinct challenges for such approaches due to domain-specific features such as missing values, multi-resolution characteristics, etc. Furthermore, the de-facto autoregressive transformers tend to learn deterministic temporal dependencies within pre-trained data while overlooking inherent uncertainties and lacking integration of physical constraints. In this paper, we introduce TimeDiT, a diffusion transformer model that synergistically combines transformer-based temporal dependency learning with diffusion-based probabilistic sampling. TimeDiT employs a unified masking mechanism to harmonize the training and inference process across diverse tasks while introducing a theoretically grounded, finetuning-free model editing strategy that enables flexible integration of external knowledge during sampling. Acknowledging the challenges of unifying multiple downstream tasks under a single model, our systematic evaluation demonstrates TimeDiT's effectiveness both in fundamental tasks, i.e., forecasting and imputation, through zero-shot/fine-tuning; and in domain tasks, i.e., multi-resolution forecasting, anomaly detection, and data generation, establishing it as a \textit{proto-foundation model} that bridges the gap between general-purpose and domain-specific models.
Paper Structure (67 sections, 1 theorem, 42 equations, 11 figures, 22 tables, 1 algorithm)

This paper contains 67 sections, 1 theorem, 42 equations, 11 figures, 22 tables, 1 algorithm.

Key Result

Theorem 3.1

The optimal $q(\mathbf{x}^\text{tar}|\mathbf{x}^\text{con})$ in Eq.eqx is the Boltzmann distribution defined on the following energy function: in other words, the optimal $q(\mathbf{x}^{tar}|\mathbf{x}^{con})$ is: where $Z = \int \exp(K(\mathbf{x}^{tar};F)+\alpha\log p(\mathbf{x}^{tar}|\mathbf{x}^{con}))d\mathbf{x}^{tar}$ is the partition function.

Figures (11)

  • Figure 1: TimeDiT Architecture. Left: TimeDiT framework with diverse multivariate time series from different domains with multi-resolution or missing values; Middle: Structure of TimeDiT block; Right top: Illustration of masks generated by Time Series Mask Unit; Right bottom: Masks for downstream tasks that TimeDiT handles during inference.
  • Figure 2: TimeDiT with textual information.
  • Figure 3: Visualization of miss value (a) and multiresolution (b) forecasting results on the Exchange dataset and miss value (c) and multiresolution (d) forecasting results on the Traffic dataset. Compared between our model TimeDiT and state-of-the-art diffusion-based methods. The x-axis number in (b) is the sampling skip in the resolutions in the multivariate input.
  • Figure 4: Visualization of miss value (a) and multi resolution (b) forecasting results on the Traffic (PEMS-SF) dataset. Compared between our model TimeDiT and state-of-the-art diffusion-based methods. The x-axis number in (b) is the sampling skip in the resolutions in the multivariate input.
  • Figure 5: Visualization of imputation task on ETT datasets. This figure illustrates TimeDiT's performance, with red $\times$'s marking observed values, blue dots showing ground truth points for interpolation, a green line representing TimeDiT's mean of interpolation, and green shading indicating its estimated uncertainty intervals.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof