Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models
Nick Janßen, Melanie Schaller, Bodo Rosenhahn
TL;DR
The paper addresses the fragility of M-LTSF models under unknown noise by introducing a parameterizable synthetic benchmark that controls signal components, noise types, SNR, and frequency content. It evaluates four architectures—S-Mamba, iTransformer, Autoformer, and R-Linear—across a factorial design, revealing that insufficient lookback windows impair all models and that each architecture exhibits distinct preferences for certain signal types and noise conditions. The work provides detailed insights into frequency reconstruction, noise robustness, and model-selection guidance, showing that S-Mamba and iTransformer excel in spectral learning while Autoformer performs well on high-frequency, sawtooth-like signals, with R-Linear serving as a fast, simple baseline. The framework and findings offer practical benchmarks for model selection and direction for future research, including richer noise models and validation on real-world data with well-characterized noise.
Abstract
Understanding the robustness of deep learning models for multivariate long-term time series forecasting (M-LTSF) remains challenging, as evaluations typically rely on real-world datasets with unknown noise properties. We propose a simulation-based evaluation framework that generates parameterizable synthetic datasets, where each dataset instance corresponds to a different configuration of signal components, noise types, signal-to-noise ratios, and frequency characteristics. These configurable components aim to model real-world multivariate time series data without the ambiguity of unknown noise. This framework enables fine-grained, systematic evaluation of M-LTSF models under controlled and diverse scenarios. We benchmark four representative architectures S-Mamba (state-space), iTransformer (transformer-based), R-Linear (linear), and Autoformer (decomposition-based). Our analysis reveals that all models degrade severely when lookback windows cannot capture complete periods of seasonal patters in the data. S-Mamba and Autoformer perform best on sawtooth patterns, while R-Linear and iTransformer favor sinusoidal signals. White and Brownian noise universally degrade performance with lower signal-to-noise ratio while S-Mamba shows specific trend-noise and iTransformer shows seasonal-noise vulnerability. Further spectral analysis shows that S-Mamba and iTransformer achieve superior frequency reconstruction. This controlled approach, based on our synthetic and principle-driven testbed, offers deeper insights into model-specific strengths and limitations through the aggregation of MSE scores and provides concrete guidance for model selection based on signal characteristics and noise conditions.
