Table of Contents
Fetching ...

Scalable Dynamic Mixture Model with Full Covariance for Probabilistic Traffic Forecasting

Seongjin Choi, Nicolas Saunier, Vincent Zhihao Zheng, Martin Trepanier, Lijun Sun

TL;DR

This work addresses the challenge of non-stationary, spatiotemporally correlated forecasting errors in traffic speed prediction by modeling the error distribution as a dynamic mixture of zero-mean matrix-variate Gaussians. Each component uses a Kronecker-structured covariance, Σ^k = Σ^k_Q ⊗ Σ^k_N, with time-varying weights ω_t^k = f_ω(X_t) and precision-based Cholesky parameterizations to enable scalable training. The model optimizes a hybrid loss, L_DynMix = (1−ρ)L_MSE + ρL_NLL, integrating both mean prediction and probabilistic uncertainty, and demonstrates improved RMSE, MAPE, and MAE on PEMS-BAY and METR-LA datasets, with interpretable spatiotemporal patterns across mixture components. The approach acts as an add-on to existing deep traffic models, offering a principled way to capture multimodal and dynamic error structures, and has potential extensions to non-Gaussian components, relaxed covariance structures, and graph-informed precision matrices for further performance gains.

Abstract

Deep learning-based multivariate and multistep-ahead traffic forecasting models are typically trained with the mean squared error (MSE) or mean absolute error (MAE) as the loss function in a sequence-to-sequence setting, simply assuming that the errors follow an independent and isotropic Gaussian or Laplacian distributions. However, such assumptions are often unrealistic for real-world traffic forecasting tasks, where the probabilistic distribution of spatiotemporal forecasting is very complex with strong concurrent correlations across both sensors and forecasting horizons in a time-varying manner. In this paper, we model the time-varying distribution for the matrix-variate error process as a dynamic mixture of zero-mean Gaussian distributions. To achieve efficiency, flexibility, and scalability, we parameterize each mixture component using a matrix normal distribution and allow the mixture weight to change and be predictable over time. The proposed method can be seamlessly integrated into existing deep-learning frameworks with only a few additional parameters to be learned. We evaluate the performance of the proposed method on a traffic speed forecasting task and find that our method not only improves model performance but also provides interpretable spatiotemporal correlation structures.

Scalable Dynamic Mixture Model with Full Covariance for Probabilistic Traffic Forecasting

TL;DR

This work addresses the challenge of non-stationary, spatiotemporally correlated forecasting errors in traffic speed prediction by modeling the error distribution as a dynamic mixture of zero-mean matrix-variate Gaussians. Each component uses a Kronecker-structured covariance, Σ^k = Σ^k_Q ⊗ Σ^k_N, with time-varying weights ω_t^k = f_ω(X_t) and precision-based Cholesky parameterizations to enable scalable training. The model optimizes a hybrid loss, L_DynMix = (1−ρ)L_MSE + ρL_NLL, integrating both mean prediction and probabilistic uncertainty, and demonstrates improved RMSE, MAPE, and MAE on PEMS-BAY and METR-LA datasets, with interpretable spatiotemporal patterns across mixture components. The approach acts as an add-on to existing deep traffic models, offering a principled way to capture multimodal and dynamic error structures, and has potential extensions to non-Gaussian components, relaxed covariance structures, and graph-informed precision matrices for further performance gains.

Abstract

Deep learning-based multivariate and multistep-ahead traffic forecasting models are typically trained with the mean squared error (MSE) or mean absolute error (MAE) as the loss function in a sequence-to-sequence setting, simply assuming that the errors follow an independent and isotropic Gaussian or Laplacian distributions. However, such assumptions are often unrealistic for real-world traffic forecasting tasks, where the probabilistic distribution of spatiotemporal forecasting is very complex with strong concurrent correlations across both sensors and forecasting horizons in a time-varying manner. In this paper, we model the time-varying distribution for the matrix-variate error process as a dynamic mixture of zero-mean Gaussian distributions. To achieve efficiency, flexibility, and scalability, we parameterize each mixture component using a matrix normal distribution and allow the mixture weight to change and be predictable over time. The proposed method can be seamlessly integrated into existing deep-learning frameworks with only a few additional parameters to be learned. We evaluate the performance of the proposed method on a traffic speed forecasting task and find that our method not only improves model performance but also provides interpretable spatiotemporal correlation structures.
Paper Structure (14 sections, 12 equations, 5 figures, 2 tables)

This paper contains 14 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Empirical results based on forecasting results from Graph Wavenet wu2019graph with PEMS-BAY data (A) temporal (i.e., over different prediction horizons) covariance matrices of Sensor # 6 at different time-of-days (2:00-3:00, 8:00-9:00, and 17:00-18:00). (B) spatial (i.e., over different sensors) covariance matrices of 12-step-ahead prediction at different time-of-days (2:00-3:00, 8:00-9:00, and 17:00-18:00).
  • Figure 2: The model training with (a) conventional MSE loss and with (b) the proposed method.
  • Figure 3: (A) Learned spatial ($\Sigma_N^k$) and temporal ($\Sigma_Q^k$) covariance matrices with $K=3$ using GWN as base model on PEMS-BAY-2017-SPEED dataset. For better visualization, we normalized the temporal covariance matrix by dividing the temporal covariance matrices by the maximum value of the diagonal entries in temporal covariance matrices and multiplied the same value to the spatial covariance matrices, since the Kronecker product has the scale-invariant property, i.e., $A\otimes B = \left(\nu A\right) \otimes \left(\frac{1}{\nu} B\right)$. (B) Examples of patterns of the mixture weight $\omega_t^k$. There were two distinct patterns for 35 days in the testing dataset, which could be categorized into weekday-pattern and weekend-pattern. The representative cases for each category are shown.
  • Figure 4: Ablation on prediction performance using the proposed loss function for $K=[1,...,10]$ using GWN as baseline model for PEMS-BAY 2017 dataset.
  • Figure 5: Ablation on different $\rho$ values