Table of Contents
Fetching ...

Statistical benchmarking of transformer models in low signal-to-noise time-series forecasting

Cyril Garcia, Guillaume Remy

TL;DR

This work tackles multivariate time-series forecasting in low-data settings by benchmarking transformer models using synthetic data with tunable temporal and cross-sectional dependencies. It introduces two-way attention transformers that alternate between temporal and cross-sectional self-attention and adds a dynamic sparsification mechanism to adapt attention across regimes. Across controlled experiments, two-way transformers outperform traditional baselines such as Lasso, boosting, and MLP in many settings, and dynamic sparsity yields substantial gains in noisy environments while preserving performance when dependencies are strong. The study provides mechanistic insight by analyzing learned attention patterns, connects to sparsity concepts in classical regression, and offers reproducible benchmarks and code for future research.

Abstract

We study the performance of transformer architectures for multivariate time-series forecasting in low-data regimes consisting of only a few years of daily observations. Using synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios, we conduct bootstrapped experiments that enable direct evaluation via out-of-sample correlations with the optimal ground-truth predictor. We show that two-way attention transformers, which alternate between temporal and cross-sectional self-attention, can outperform standard baselines-Lasso, boosting methods, and fully connected multilayer perceptrons-across a wide range of settings, including low signal-to-noise regimes. We further introduce a dynamic sparsification procedure for attention matrices applied during training, and demonstrate that it becomes significantly effective in noisy environments, where the correlation between the target variable and the optimal predictor is on the order of a few percent. Analysis of the learned attention patterns reveals interpretable structure and suggests connections to sparsity-inducing regularization in classical regression, providing insight into why these models generalize effectively under noise.

Statistical benchmarking of transformer models in low signal-to-noise time-series forecasting

TL;DR

This work tackles multivariate time-series forecasting in low-data settings by benchmarking transformer models using synthetic data with tunable temporal and cross-sectional dependencies. It introduces two-way attention transformers that alternate between temporal and cross-sectional self-attention and adds a dynamic sparsification mechanism to adapt attention across regimes. Across controlled experiments, two-way transformers outperform traditional baselines such as Lasso, boosting, and MLP in many settings, and dynamic sparsity yields substantial gains in noisy environments while preserving performance when dependencies are strong. The study provides mechanistic insight by analyzing learned attention patterns, connects to sparsity concepts in classical regression, and offers reproducible benchmarks and code for future research.

Abstract

We study the performance of transformer architectures for multivariate time-series forecasting in low-data regimes consisting of only a few years of daily observations. Using synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios, we conduct bootstrapped experiments that enable direct evaluation via out-of-sample correlations with the optimal ground-truth predictor. We show that two-way attention transformers, which alternate between temporal and cross-sectional self-attention, can outperform standard baselines-Lasso, boosting methods, and fully connected multilayer perceptrons-across a wide range of settings, including low signal-to-noise regimes. We further introduce a dynamic sparsification procedure for attention matrices applied during training, and demonstrate that it becomes significantly effective in noisy environments, where the correlation between the target variable and the optimal predictor is on the order of a few percent. Analysis of the learned attention patterns reveals interpretable structure and suggests connections to sparsity-inducing regularization in classical regression, providing insight into why these models generalize effectively under noise.
Paper Structure (17 sections, 19 equations, 13 tables)