Table of Contents
Fetching ...

AltTS: A Dual-Path Framework with Alternating Optimization for Multivariate Time Series Forecasting

Zhihang Yuan, Zhiyuan Liu, Mahesh K. Marina

TL;DR

ALTTS addresses gradient entanglement in multivariate time series forecasting by decoupling autoregressive (AR) dynamics from cross-dimension (CR) interactions into a dual-path framework. The AR path is a linear per-series predictor with RevIN, while the CR path is a Transformer with Cross-Relation Self-Attention (CRSA) that explicitly models cross-variable dependencies, with diagonal masking to prevent AR leakage. These paths are coordinated through alternating optimization (AO), updating AR and CR parameters in turn with independent optimizers to reduce gradient noise and interference. Empirically, ALTTS achieves competitive to state-of-the-art results across seven LTSF benchmarks, with the largest gains at long horizons, and ablations confirm the critical roles of AR/CR decoupling and AO in stabilizing training and improving accuracy. The work highlights training schedules as a design variable, suggesting optimization-driven architectural choices can drive progress as effectively as more complex models.

Abstract

Multivariate time series forecasting involves two qualitatively distinct factors: (i) stable within-series autoregressive (AR) dynamics, and (ii) intermittent cross-dimension interactions that can become spurious over long horizons. We argue that fitting a single model to capture both effects creates an optimization conflict: the high-variance updates needed for cross-dimension modeling can corrupt the gradients that support autoregression, resulting in brittle training and degraded long-horizon accuracy. To address this, we propose ALTTS, a dual-path framework that explicitly decouples autoregression and cross-relation (CR) modeling. In ALTTS, the AR path is instantiated with a linear predictor, while the CR path uses a Transformer equipped with Cross-Relation Self-Attention (CRSA); the two branches are coordinated via alternating optimization to isolate gradient noise and reduce cross-block interference. Extensive experiments on multiple benchmarks show that ALTTS consistently outperforms prior methods, with the most pronounced improvements on long-horizon forecasting. Overall, our results suggest that carefully designed optimization strategies, rather than ever more complex architectures, can be a key driver of progress in multivariate time series forecasting.

AltTS: A Dual-Path Framework with Alternating Optimization for Multivariate Time Series Forecasting

TL;DR

ALTTS addresses gradient entanglement in multivariate time series forecasting by decoupling autoregressive (AR) dynamics from cross-dimension (CR) interactions into a dual-path framework. The AR path is a linear per-series predictor with RevIN, while the CR path is a Transformer with Cross-Relation Self-Attention (CRSA) that explicitly models cross-variable dependencies, with diagonal masking to prevent AR leakage. These paths are coordinated through alternating optimization (AO), updating AR and CR parameters in turn with independent optimizers to reduce gradient noise and interference. Empirically, ALTTS achieves competitive to state-of-the-art results across seven LTSF benchmarks, with the largest gains at long horizons, and ablations confirm the critical roles of AR/CR decoupling and AO in stabilizing training and improving accuracy. The work highlights training schedules as a design variable, suggesting optimization-driven architectural choices can drive progress as effectively as more complex models.

Abstract

Multivariate time series forecasting involves two qualitatively distinct factors: (i) stable within-series autoregressive (AR) dynamics, and (ii) intermittent cross-dimension interactions that can become spurious over long horizons. We argue that fitting a single model to capture both effects creates an optimization conflict: the high-variance updates needed for cross-dimension modeling can corrupt the gradients that support autoregression, resulting in brittle training and degraded long-horizon accuracy. To address this, we propose ALTTS, a dual-path framework that explicitly decouples autoregression and cross-relation (CR) modeling. In ALTTS, the AR path is instantiated with a linear predictor, while the CR path uses a Transformer equipped with Cross-Relation Self-Attention (CRSA); the two branches are coordinated via alternating optimization to isolate gradient noise and reduce cross-block interference. Extensive experiments on multiple benchmarks show that ALTTS consistently outperforms prior methods, with the most pronounced improvements on long-horizon forecasting. Overall, our results suggest that carefully designed optimization strategies, rather than ever more complex architectures, can be a key driver of progress in multivariate time series forecasting.
Paper Structure (32 sections, 1 theorem, 14 equations, 6 figures, 3 tables)

This paper contains 32 sections, 1 theorem, 14 equations, 6 figures, 3 tables.

Key Result

Proposition 3.2

Suppose the loss function $\mathcal{L}(\theta_{\text{AR}}, \theta_{\text{CR}};B)=\ell(\theta_{\text{AR}}, \theta_{\text{CR}};B) + R_{\text{AR}}(\theta_{\text{AR}}) + R_{\text{CR}}(\theta_{\text{CR}})$ is the same under alternating and joint training, where $B$ denotes a random minibatch. Let $\mathr

Figures (6)

  • Figure 1: Variance of AR/CR gradients under alternating vs. joint training on seven datasets (prediction length $=96$). The y-axis is the natural log of the gradient variance statistic. We compute the rolling sample variance of gradients for each parameter over the last $K$ updates; within each branch, we take the parameter-wise sum to yield a scalar for AR and CR in the rolling window, respectively. Extended plots for horizons $192/336/720$ are in Appendix \ref{['app:extended-gv']}.
  • Figure 2: Architecture of AltTS. (a) Multivariate time series is passed into two parallel paths, the channel-independent AR path and the channel-dependent CR path. Outputs are summed to obtain the final prediction. (b) The cross-relation self-attention forms queries/keys/values from per-variable embeddings and an AR mask is applied to the attention matrix to suppress intra-series links.
  • Figure 3: Prediction length $=96$. Variance of AR/CR gradients under joint training across seven datasets. Higher variance indicates greater training instability, motivating alternating optimization in AltTS.
  • Figure 4: Prediction length $=192$. Variance of AR/CR gradients under joint training across seven datasets.
  • Figure 5: Prediction length $=336$. Variance of AR/CR gradients under joint training across seven datasets.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 3.2