Table of Contents
Fetching ...

Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models

Nick Janßen, Melanie Schaller, Bodo Rosenhahn

TL;DR

The paper addresses the fragility of M-LTSF models under unknown noise by introducing a parameterizable synthetic benchmark that controls signal components, noise types, SNR, and frequency content. It evaluates four architectures—S-Mamba, iTransformer, Autoformer, and R-Linear—across a factorial design, revealing that insufficient lookback windows impair all models and that each architecture exhibits distinct preferences for certain signal types and noise conditions. The work provides detailed insights into frequency reconstruction, noise robustness, and model-selection guidance, showing that S-Mamba and iTransformer excel in spectral learning while Autoformer performs well on high-frequency, sawtooth-like signals, with R-Linear serving as a fast, simple baseline. The framework and findings offer practical benchmarks for model selection and direction for future research, including richer noise models and validation on real-world data with well-characterized noise.

Abstract

Understanding the robustness of deep learning models for multivariate long-term time series forecasting (M-LTSF) remains challenging, as evaluations typically rely on real-world datasets with unknown noise properties. We propose a simulation-based evaluation framework that generates parameterizable synthetic datasets, where each dataset instance corresponds to a different configuration of signal components, noise types, signal-to-noise ratios, and frequency characteristics. These configurable components aim to model real-world multivariate time series data without the ambiguity of unknown noise. This framework enables fine-grained, systematic evaluation of M-LTSF models under controlled and diverse scenarios. We benchmark four representative architectures S-Mamba (state-space), iTransformer (transformer-based), R-Linear (linear), and Autoformer (decomposition-based). Our analysis reveals that all models degrade severely when lookback windows cannot capture complete periods of seasonal patters in the data. S-Mamba and Autoformer perform best on sawtooth patterns, while R-Linear and iTransformer favor sinusoidal signals. White and Brownian noise universally degrade performance with lower signal-to-noise ratio while S-Mamba shows specific trend-noise and iTransformer shows seasonal-noise vulnerability. Further spectral analysis shows that S-Mamba and iTransformer achieve superior frequency reconstruction. This controlled approach, based on our synthetic and principle-driven testbed, offers deeper insights into model-specific strengths and limitations through the aggregation of MSE scores and provides concrete guidance for model selection based on signal characteristics and noise conditions.

Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models

TL;DR

The paper addresses the fragility of M-LTSF models under unknown noise by introducing a parameterizable synthetic benchmark that controls signal components, noise types, SNR, and frequency content. It evaluates four architectures—S-Mamba, iTransformer, Autoformer, and R-Linear—across a factorial design, revealing that insufficient lookback windows impair all models and that each architecture exhibits distinct preferences for certain signal types and noise conditions. The work provides detailed insights into frequency reconstruction, noise robustness, and model-selection guidance, showing that S-Mamba and iTransformer excel in spectral learning while Autoformer performs well on high-frequency, sawtooth-like signals, with R-Linear serving as a fast, simple baseline. The framework and findings offer practical benchmarks for model selection and direction for future research, including richer noise models and validation on real-world data with well-characterized noise.

Abstract

Understanding the robustness of deep learning models for multivariate long-term time series forecasting (M-LTSF) remains challenging, as evaluations typically rely on real-world datasets with unknown noise properties. We propose a simulation-based evaluation framework that generates parameterizable synthetic datasets, where each dataset instance corresponds to a different configuration of signal components, noise types, signal-to-noise ratios, and frequency characteristics. These configurable components aim to model real-world multivariate time series data without the ambiguity of unknown noise. This framework enables fine-grained, systematic evaluation of M-LTSF models under controlled and diverse scenarios. We benchmark four representative architectures S-Mamba (state-space), iTransformer (transformer-based), R-Linear (linear), and Autoformer (decomposition-based). Our analysis reveals that all models degrade severely when lookback windows cannot capture complete periods of seasonal patters in the data. S-Mamba and Autoformer perform best on sawtooth patterns, while R-Linear and iTransformer favor sinusoidal signals. White and Brownian noise universally degrade performance with lower signal-to-noise ratio while S-Mamba shows specific trend-noise and iTransformer shows seasonal-noise vulnerability. Further spectral analysis shows that S-Mamba and iTransformer achieve superior frequency reconstruction. This controlled approach, based on our synthetic and principle-driven testbed, offers deeper insights into model-specific strengths and limitations through the aggregation of MSE scores and provides concrete guidance for model selection based on signal characteristics and noise conditions.

Paper Structure

This paper contains 24 sections, 15 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Radarplots to guide model selection. Left Figure inverse of the best MSE value of each model across seasonality types (sine, square, sawtooth) and noise types (white noise, Brownian noise, impulse noise, trend noise and seasonal noise) with an SNR value of 100. S-Mamba shows best overall performance on all different dataset characteristics. The right figure shows the efficiency of the model. Features are inference (Iterations/s), training time (It/s), number of parameters (inverse) and the best MSE Score (inverse) across all evaluated experiments. R-Linear shows fast inference time but lacks in performance, while iTransformer and S-Mamba show good performance but worse inference speed.
  • Figure 2: Example synthetic time series components. Upper panel: trend components with varying exponent values $b$. Lower panel: seasonality components showing different waveform types (Sine, Saw, Square) with varying frequency and phase parameters.
  • Figure 3: Example time series demonstrating different noise types: White Noise and Brownian Noise as the cumulative of White Noise (top left), Impulse Noise (top right), Trend Noise (bottom left), and Seasonal Noise (bottom right). All noise types except white and Brownian noisecan exhibit inverted or anti-proportional characteristics.
  • Figure 4: Overview of time series generation for a univariate signal. A time series is constructed for each component individually. Each component is weighted by a randomly sampled factor ($w_{signal,i}$ and $w_{noise,i}$) and aggregated into separate signal and noise time series. The signal and noise are then scaled according to the specified signal-to-noise ratio and combined to produce the final time series shown on the right. All displayed time series are z-normalized, and the shown weights are randomly sampled examples.
  • Figure 5: Overview of the frequency index ranges used in the evaluations for analyzing frequency-dependent model behaviors. The defined ranges are: very low (1–500), low (500–1000), low-mid (1000–1500), mid (6000–6500), mid-high (8000-8500), high (12000–12500), and very high (16000–16500).
  • ...and 11 more figures