Table of Contents
Fetching ...

A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting

Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus

TL;DR

This paper tackles the scalability challenge of Transformer-based time-series forecasting by benchmarking unstructured and structured pruning across five models and multiple datasets. The authors quantify predictive performance, parameter counts, FLOPs, and inference-time behavior, and they explore fine-tuning and dataset-size effects to rule out overfitting. They find that unstructured pruning can preserve accuracy up to roughly $s \approx 0.5$, with Fourier-based models (Autoformer, FEDformer) tolerating much higher sparsity near $s \approx 0.9$, while structured pruning yields notable FLOP reductions but limited real-world speedups due to architectural interdependencies. The results emphasize strong dataset-dependence, showing small models can outperform large ones on small datasets, and that larger datasets still benefit from high-capacity models, guiding practical deployment and motivating hardware-aware compression strategies for time-series transformers.

Abstract

The current landscape in time-series forecasting is dominated by Transformer-based models. Their high parameter count and corresponding demand in computational resources pose a challenge to real-world deployment, especially for commercial and scientific applications with low-power embedded devices. Pruning is an established approach to reduce neural network parameter count and save compute. However, the implications and benefits of pruning Transformer-based models for time series forecasting are largely unknown. To close this gap, we provide a comparative benchmark study by evaluating unstructured and structured pruning on various state-of-the-art multivariate time series models. We study the effects of these pruning strategies on model predictive performance and computational aspects like model size, operations, and inference time. Our results show that certain models can be pruned even up to high sparsity levels, outperforming their dense counterpart. However, fine-tuning pruned models is necessary. Furthermore, we demonstrate that even with corresponding hardware and software support, structured pruning is unable to provide significant time savings.

A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting

TL;DR

This paper tackles the scalability challenge of Transformer-based time-series forecasting by benchmarking unstructured and structured pruning across five models and multiple datasets. The authors quantify predictive performance, parameter counts, FLOPs, and inference-time behavior, and they explore fine-tuning and dataset-size effects to rule out overfitting. They find that unstructured pruning can preserve accuracy up to roughly , with Fourier-based models (Autoformer, FEDformer) tolerating much higher sparsity near , while structured pruning yields notable FLOP reductions but limited real-world speedups due to architectural interdependencies. The results emphasize strong dataset-dependence, showing small models can outperform large ones on small datasets, and that larger datasets still benefit from high-capacity models, guiding practical deployment and motivating hardware-aware compression strategies for time-series transformers.

Abstract

The current landscape in time-series forecasting is dominated by Transformer-based models. Their high parameter count and corresponding demand in computational resources pose a challenge to real-world deployment, especially for commercial and scientific applications with low-power embedded devices. Pruning is an established approach to reduce neural network parameter count and save compute. However, the implications and benefits of pruning Transformer-based models for time series forecasting are largely unknown. To close this gap, we provide a comparative benchmark study by evaluating unstructured and structured pruning on various state-of-the-art multivariate time series models. We study the effects of these pruning strategies on model predictive performance and computational aspects like model size, operations, and inference time. Our results show that certain models can be pruned even up to high sparsity levels, outperforming their dense counterpart. However, fine-tuning pruned models is necessary. Furthermore, we demonstrate that even with corresponding hardware and software support, structured pruning is unable to provide significant time savings.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Pruning results for weight magnitude pruning. Plotted is the MSE on the test dataset over the parameter density of the models for all datasets and forecast lengths, with logarithmic scaling on the x-axis. Models are Transformer, Informer, Autoformer, FEDformer, and Crossformer. Best viewed zoomed in.
  • Figure 2: Pruning results for structured node pruning using torch-pruning. Plotted is the MSE on the test dataset over the measured parameter density for all forecast lengths and datasets. Models are Transformer, Informer, Autoformer, FEDformer, and Crossformer. Best viewed zoomed in.
  • Figure 3: Pruning results for weight magnitude pruning on the ENTSO-E test dataset with prediction length 192. Plotted is the MSE of all large models over their parameter density. Models are Transformer, Informer, Autoformer, FEDformer, and Crossformer.
  • Figure 4: Loss curves for all models during training on the ENTSO-E dataset, averaged over 50 steps to reduce noise; and examples predictions on the test dataset after training.