DiTS: Multimodal Diffusion Transformers Are Time Series Forecasters
Haoran Zhang, Haixuan Liu, Yong Liu, Yunzhong Qiu, Yuxuan Wang, Jianmin Wang, Mingsheng Long
TL;DR
DiTS addresses covariate-aware probabilistic forecasting for multivariate time series by recasting endogenous and exogenous variates as distinct modalities within a Multimodal Diffusion Transformer framework. It introduces a dual-stream DiTS-Block with Time Attention and Variate Attention, augmented by adaptive modulation and a flow-matching training objective to efficiently capture intra- and inter-variate dependencies without flattening 2D time-variate grids. The model achieves state-of-the-art results on covariate-aware benchmarks (e.g., FEV-Bench and EPF) for both probabilistic and deterministic forecasting, with extensive ablations validating the benefits of the MM-DiT-inspired conditioning and attention design. Overall, DiTS provides a robust, scalable backbone that bridges diffusion-based generative modeling with high-dimensional time-series forecasting, paving the way for general-purpose Time Series Foundation Models capable of large-scale generative forecasting across domains.
Abstract
While generative modeling on time series facilitates more capable and flexible probabilistic forecasting, existing generative time series models do not address the multi-dimensional properties of time series data well. The prevalent architecture of Diffusion Transformers (DiT), which relies on simplistic conditioning controls and a single-stream Transformer backbone, tends to underutilize cross-variate dependencies in covariate-aware forecasting. Inspired by Multimodal Diffusion Transformers that integrate textual guidance into video generation, we propose Diffusion Transformers for Time Series (DiTS), a general-purpose architecture that frames endogenous and exogenous variates as distinct modalities. To better capture both inter-variate and intra-variate dependencies, we design a dual-stream Transformer block tailored for time-series data, comprising a Time Attention module for autoregressive modeling along the temporal dimension and a Variate Attention module for cross-variate modeling. Unlike the common approach for images, which flattens 2D token grids into 1D sequences, our design leverages the low-rank property inherent in multivariate dependencies, thereby reducing computational costs. Experiments show that DiTS achieves state-of-the-art performance across benchmarks, regardless of the presence of future exogenous variate observations, demonstrating unique generative forecasting strengths over traditional deterministic deep forecasting models.
