Table of Contents
Fetching ...

DiTS: Multimodal Diffusion Transformers Are Time Series Forecasters

Haoran Zhang, Haixuan Liu, Yong Liu, Yunzhong Qiu, Yuxuan Wang, Jianmin Wang, Mingsheng Long

TL;DR

DiTS addresses covariate-aware probabilistic forecasting for multivariate time series by recasting endogenous and exogenous variates as distinct modalities within a Multimodal Diffusion Transformer framework. It introduces a dual-stream DiTS-Block with Time Attention and Variate Attention, augmented by adaptive modulation and a flow-matching training objective to efficiently capture intra- and inter-variate dependencies without flattening 2D time-variate grids. The model achieves state-of-the-art results on covariate-aware benchmarks (e.g., FEV-Bench and EPF) for both probabilistic and deterministic forecasting, with extensive ablations validating the benefits of the MM-DiT-inspired conditioning and attention design. Overall, DiTS provides a robust, scalable backbone that bridges diffusion-based generative modeling with high-dimensional time-series forecasting, paving the way for general-purpose Time Series Foundation Models capable of large-scale generative forecasting across domains.

Abstract

While generative modeling on time series facilitates more capable and flexible probabilistic forecasting, existing generative time series models do not address the multi-dimensional properties of time series data well. The prevalent architecture of Diffusion Transformers (DiT), which relies on simplistic conditioning controls and a single-stream Transformer backbone, tends to underutilize cross-variate dependencies in covariate-aware forecasting. Inspired by Multimodal Diffusion Transformers that integrate textual guidance into video generation, we propose Diffusion Transformers for Time Series (DiTS), a general-purpose architecture that frames endogenous and exogenous variates as distinct modalities. To better capture both inter-variate and intra-variate dependencies, we design a dual-stream Transformer block tailored for time-series data, comprising a Time Attention module for autoregressive modeling along the temporal dimension and a Variate Attention module for cross-variate modeling. Unlike the common approach for images, which flattens 2D token grids into 1D sequences, our design leverages the low-rank property inherent in multivariate dependencies, thereby reducing computational costs. Experiments show that DiTS achieves state-of-the-art performance across benchmarks, regardless of the presence of future exogenous variate observations, demonstrating unique generative forecasting strengths over traditional deterministic deep forecasting models.

DiTS: Multimodal Diffusion Transformers Are Time Series Forecasters

TL;DR

DiTS addresses covariate-aware probabilistic forecasting for multivariate time series by recasting endogenous and exogenous variates as distinct modalities within a Multimodal Diffusion Transformer framework. It introduces a dual-stream DiTS-Block with Time Attention and Variate Attention, augmented by adaptive modulation and a flow-matching training objective to efficiently capture intra- and inter-variate dependencies without flattening 2D time-variate grids. The model achieves state-of-the-art results on covariate-aware benchmarks (e.g., FEV-Bench and EPF) for both probabilistic and deterministic forecasting, with extensive ablations validating the benefits of the MM-DiT-inspired conditioning and attention design. Overall, DiTS provides a robust, scalable backbone that bridges diffusion-based generative modeling with high-dimensional time-series forecasting, paving the way for general-purpose Time Series Foundation Models capable of large-scale generative forecasting across domains.

Abstract

While generative modeling on time series facilitates more capable and flexible probabilistic forecasting, existing generative time series models do not address the multi-dimensional properties of time series data well. The prevalent architecture of Diffusion Transformers (DiT), which relies on simplistic conditioning controls and a single-stream Transformer backbone, tends to underutilize cross-variate dependencies in covariate-aware forecasting. Inspired by Multimodal Diffusion Transformers that integrate textual guidance into video generation, we propose Diffusion Transformers for Time Series (DiTS), a general-purpose architecture that frames endogenous and exogenous variates as distinct modalities. To better capture both inter-variate and intra-variate dependencies, we design a dual-stream Transformer block tailored for time-series data, comprising a Time Attention module for autoregressive modeling along the temporal dimension and a Variate Attention module for cross-variate modeling. Unlike the common approach for images, which flattens 2D token grids into 1D sequences, our design leverages the low-rank property inherent in multivariate dependencies, thereby reducing computational costs. Experiments show that DiTS achieves state-of-the-art performance across benchmarks, regardless of the presence of future exogenous variate observations, demonstrating unique generative forecasting strengths over traditional deterministic deep forecasting models.
Paper Structure (44 sections, 11 equations, 15 figures, 10 tables)

This paper contains 44 sections, 11 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Structural homogeneity between multimodal generation and multivariate forecasting. DiTS treats variates as distinct, high-fidelity streams in MM-DiT esser2024scaling style, which enables the model to leverage modality-level feature discrepancies while maintaining conditional control via fine-grained interaction.
  • Figure 2: Structural comparison of multivariate architectures. Existing Transformer-based forecasters typically adopt either channel-independent nie2022time or asymmetric modeling strategies wang2024timexer, and thorough modeling among inter-variate patches often leads to excessive complexity while achieving the refined granularity liu2024timerxl. DiTS employs a dual-stream design that treats time and variate dimensions as orthogonal axes, enabling efficient but fine-grained interaction akin to multimodal generation.
  • Figure 3: Overall architecture of DiTS. We regard the endogenous variate and exogenous variate as modes $x$ and $c$ of MM-DiT input, cooperate with the modulation signal $y$ for interaction, and finally transform the output flow of $x$ into a denoising velocity field.
  • Figure 4: Schematic illustration of the DiTS Block. Here, $T$, $V$, and $F$ denote the modulation parameters for the time attention, variate attention, and FFN modules, respectively. Given the heterogeneity of the dual streams $x$ and $c$, we employ shared time attention and FFN to model inter-token dependencies and intra-token information interaction. Conversely, joint attention facilitates variate-level interaction, enabling the denoising process of the endogenous variate window to be guided by covariate-based conditional control.
  • Figure 5: Probabilistic forecasting performance comparison on the FEV leaderboard subset. We report the Average Weighted Quantile Loss (WQL) and Mean Absolute Scaled Error (MASE) across various competitive baselines. DiTS consistently achieves the lowest error across both metrics, demonstrating its robust probabilistic modeling capability. Full results can be found in Table \ref{['tab:fev_full']}.
  • ...and 10 more figures