A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting
Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus
TL;DR
This paper tackles the scalability challenge of Transformer-based time-series forecasting by benchmarking unstructured and structured pruning across five models and multiple datasets. The authors quantify predictive performance, parameter counts, FLOPs, and inference-time behavior, and they explore fine-tuning and dataset-size effects to rule out overfitting. They find that unstructured pruning can preserve accuracy up to roughly $s \approx 0.5$, with Fourier-based models (Autoformer, FEDformer) tolerating much higher sparsity near $s \approx 0.9$, while structured pruning yields notable FLOP reductions but limited real-world speedups due to architectural interdependencies. The results emphasize strong dataset-dependence, showing small models can outperform large ones on small datasets, and that larger datasets still benefit from high-capacity models, guiding practical deployment and motivating hardware-aware compression strategies for time-series transformers.
Abstract
The current landscape in time-series forecasting is dominated by Transformer-based models. Their high parameter count and corresponding demand in computational resources pose a challenge to real-world deployment, especially for commercial and scientific applications with low-power embedded devices. Pruning is an established approach to reduce neural network parameter count and save compute. However, the implications and benefits of pruning Transformer-based models for time series forecasting are largely unknown. To close this gap, we provide a comparative benchmark study by evaluating unstructured and structured pruning on various state-of-the-art multivariate time series models. We study the effects of these pruning strategies on model predictive performance and computational aspects like model size, operations, and inference time. Our results show that certain models can be pruned even up to high sparsity levels, outperforming their dense counterpart. However, fine-tuning pruned models is necessary. Furthermore, we demonstrate that even with corresponding hardware and software support, structured pruning is unable to provide significant time savings.
