Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

Zahra Atashgahi; Mykola Pechenizkiy; Raymond Veldhuis; Decebal Constantin Mocanu

Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

Zahra Atashgahi, Mykola Pechenizkiy, Raymond Veldhuis, Decebal Constantin Mocanu

TL;DR

PALS addresses the challenge of automatically balancing loss and sparsity during training for time series forecasting with Transformers. It introduces an expand mechanism alongside shrink and stable updates to dynamically adjust connectivity, eliminating the need for predefined sparsity targets. Across six datasets and multiple transformer variants, PALS achieves substantial parameter and FLOP reductions while preserving or even improving forecasting accuracy (MSE/MAE), outperforming dense baselines in a notable fraction of cases. The method leverages concepts from sparse training and during-training pruning, and remains broadly applicable beyond Transformers, with practical implications for deploying efficient, scalable time series models. While hardware sparsity support remains a bottleneck, PALS demonstrates a principled, automated approach to compact models with real-world forecasting impact.

Abstract

Efficient time series forecasting has become critical for real-world applications, particularly with deep neural networks (DNNs). Efficiency in DNNs can be achieved through sparse connectivity and reducing the model size. However, finding the sparsity level automatically during training remains challenging due to the heterogeneity in the loss-sparsity tradeoffs across the datasets. In this paper, we propose \enquote{\textbf{P}runing with \textbf{A}daptive \textbf{S}parsity \textbf{L}evel} (\textbf{PALS}), to automatically seek a decent balance between loss and sparsity, all without the need for a predefined sparsity level. PALS draws inspiration from sparse training and during-training methods. It introduces the novel "expand" mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level. In this paper, we focus on achieving efficiency in transformers known for their excellent time series forecasting performance but high computational cost. Nevertheless, PALS can be applied directly to any DNN. To this aim, we demonstrate its effectiveness also on the DLinear model. Experimental results on six benchmark datasets and five state-of-the-art (SOTA) transformer variants show that PALS substantially reduces model size while maintaining comparable performance to the dense model. More interestingly, PALS even outperforms the dense model, in \textcolor{blue}{12} and \textcolor{blue}{14} cases out of 30 cases in terms of MSE and MAE loss, respectively, while reducing \textcolor{blue}{65\%} parameter count and \textcolor{blue}{63\%} FLOPs on average. Our code and supplementary material are available on Github\footnote{\tiny \url{https://github.com/zahraatashgahi/PALS}}.

Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

TL;DR

Abstract

Paper Structure (42 sections, 5 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 42 sections, 5 equations, 11 figures, 11 tables, 1 algorithm.

Introduction
Background
Sparse Neural Networks
Dense-to-sparse
Sparse-to-sparse
Sparsity in Transformers.
Time Series Forecasting
Problem Formulation and Notations
Analyzing Sparsity Effect in Transformers for Time Series Forecasting
Experimental Settings.
Sparsity Effect.
Challenge.
Proposed Methodology: PALS
Motivation and Broad Outline.
Experiments and Results
...and 27 more sections

Figures (11)

Figure 1: Schematic overview of the proposed method, PALS (Algorithm \ref{['alg:PALS']}), Dynamic Sparse Training (DST) mocanu2018scalableevci2020rigging, During-training pruning (Gradual Magnitude Pruning (GMP) zhu2017prune, and GraNet liu2021sparse). While DST and during-training pruning use a fixed sparsity schedule to achieve a pre-determined sparsity level at the end of the training, PALS updates the sparse connectivity of the network at each $\Delta t$ iterations during training, by deciding whether to "Shrink" (decrease density) or "Expand" (increase density) the network or remain "Stable" (same density), to automatically find a proper sparsity level.
Figure 2: Sparsity effect on the performance of various transformer models for time series forecasting on benchmark datasets in terms of MSE loss (prediction length $=96$, except $24$ for the Illness dataset). Each model is sparsified using GraNet liu2021sparse to sparsity levels ($\%$) $\in \{25, 50, 65, 80, 90, 95\}$ and PALS. $Sparsity = 0$ indicates the original dense model.
Figure 3: Sparsity effect on the performance of various transformer models for time series forecasting on benchmark datasets in terms of MSE loss for various prediction lengths as indicated in each figure. Each model is sparsified using GraNet liu2021sparse to sparsity levels ($\%$) $\in \{25, 50, 65, 80, 90, 95\}$. Sparsity=0 indicates the original dense model.
Figure 4: Sparsity level of each network during training of PALS. In most cases, the final sparsity is achieved within a few epochs after the training starts. Therefore, the forward pass during training is performed sparsely for a large fraction of the training process.
Figure 5: Model size effect by varying $d_{model} \in \{256, 512 (default), 768\}$) on the prediction performance of PALS compared to the original dense model.
...and 6 more figures

Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

TL;DR

Abstract

Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (11)