Table of Contents
Fetching ...

Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

Kashif Rasul, Calvin Seward, Ingmar Schuster, Roland Vollgraf

TL;DR

This work introduces TimeGrad, an autoregressive denoising diffusion model for multivariate probabilistic time series forecasting that learns per-step conditional distributions by denoising diffused observations. It combines an RNN-based hidden state with a diffusion emission head and trains via a variational diffusion objective, using Langevin-like sampling to generate multiple future trajectories for uncertainty quantification. The method achieves state-of-the-art CRPS_sum on six diverse real-world datasets, demonstrating strong probabilistic forecasting performance in high-dimensional settings. The paper also discusses ablations, scaling techniques, covariate integration, and future directions for faster sampling and extensions with advanced architectures.

Abstract

In this work, we propose \texttt{TimeGrad}, an autoregressive model for multivariate probabilistic time series forecasting which samples from the data distribution at each time step by estimating its gradient. To this end, we use diffusion probabilistic models, a class of latent variable models closely connected to score matching and energy-based methods. Our model learns gradients by optimizing a variational bound on the data likelihood and at inference time converts white noise into a sample of the distribution of interest through a Markov chain using Langevin sampling. We demonstrate experimentally that the proposed autoregressive denoising diffusion model is the new state-of-the-art multivariate probabilistic forecasting method on real-world data sets with thousands of correlated dimensions. We hope that this method is a useful tool for practitioners and lays the foundation for future research in this area.

Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

TL;DR

This work introduces TimeGrad, an autoregressive denoising diffusion model for multivariate probabilistic time series forecasting that learns per-step conditional distributions by denoising diffused observations. It combines an RNN-based hidden state with a diffusion emission head and trains via a variational diffusion objective, using Langevin-like sampling to generate multiple future trajectories for uncertainty quantification. The method achieves state-of-the-art CRPS_sum on six diverse real-world datasets, demonstrating strong probabilistic forecasting performance in high-dimensional settings. The paper also discusses ablations, scaling techniques, covariate integration, and future directions for faster sampling and extensions with advanced architectures.

Abstract

In this work, we propose \texttt{TimeGrad}, an autoregressive model for multivariate probabilistic time series forecasting which samples from the data distribution at each time step by estimating its gradient. To this end, we use diffusion probabilistic models, a class of latent variable models closely connected to score matching and energy-based methods. Our model learns gradients by optimizing a variational bound on the data likelihood and at inference time converts white noise into a sample of the distribution of interest through a Markov chain using Langevin sampling. We demonstrate experimentally that the proposed autoregressive denoising diffusion model is the new state-of-the-art multivariate probabilistic forecasting method on real-world data sets with thousands of correlated dimensions. We hope that this method is a useful tool for practitioners and lays the foundation for future research in this area.

Paper Structure

This paper contains 16 sections, 21 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: TimeGrad schematic: an RNN conditioned diffusion probabilistic model at some time $t-1$ depicting the fixed forward process that adds Gaussian noise and the learned reverse processes.
  • Figure 2: The network architecture of $\epsilon_\theta$ consisting of $\mathtt{residual\_layers}=8$ conditional residual blocks with the Gated Activation Unit $\sigma(\cdot) \odot \tanh(\cdot)$ from NIPS2016_b1301141; whose skip-connection outputs are summed up to compute the final output. Conv1x1 and Conv1d are 1D convolutional layers with filter size of $1$ and $3$, respectively, circular padding so that the spatial size remains $D$, and all but the last convolutional layer has output channels $\mathtt{residual\_channels}=8$. FC are linear layers used to up/down-sample the input to the appropriate size for broadcasting.
  • Figure 3: TimeGrad test set $\mathrm{CRPS}_{\mathrm{sum}}$ for Electricity data by varying total diffusion length $N$. Good performance is established already at $N \approx 10$ with optimal value at $N\approx100$. The mean and standard errors obtained over $5$ independent runs. We see similar behaviour with other data sets.
  • Figure 4: TimeGrad prediction intervals and test set ground-truth for Traffic data of the first $6$ of $963$ dimensions from first rolling-window. Note that neighboring entities have an order of magnitude difference in scales.