Table of Contents
Fetching ...

Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning

Lifan Zhao, Yanyan Shen, Zhaoyang Liu, Xue Wang, Jiaji Deng

TL;DR

The study exposes inherent sparsity and task-specific substructures in Time Series Foundation Models and proposes a prune-then-finetune pipeline to preserve architectural priors while specializing TSFMs for downstream forecasting. By defining pruning units as input/output channels and employing loss-guided importance scores, the method progressively removes redundant components before fine-tuning, yielding improved accuracy and up to 7x faster inference. Across seven TSFMs and six benchmarks, the approach often outperforms full fine-tuning and strong specialized baselines, with notable zero-shot transfer benefits within related domains. This work highlights architectural specialization as a practical route to unlock TSFMs' potential in real-world forecasting tasks.

Abstract

Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting task. Through empirical studies on various TSFMs, the pre-trained models often exhibit inherent sparsity and redundancy in computation, suggesting that TSFMs have learned to activate task-relevant network substructures to accommodate diverse forecasting tasks. To preserve this valuable prior knowledge, we propose a structured pruning method to regularize the subsequent fine-tuning process by focusing it on a more relevant and compact parameter space. Extensive experiments on seven TSFMs and six benchmarks demonstrate that fine-tuning a smaller, pruned TSFM significantly improves forecasting performance compared to fine-tuning original models. This prune-then-finetune paradigm often enables TSFMs to achieve state-of-the-art performance and surpass strong specialized baselines. Source code is made publicly available at https://github.com/SJTU-DMTai/Prune-then-Finetune.

Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning

TL;DR

The study exposes inherent sparsity and task-specific substructures in Time Series Foundation Models and proposes a prune-then-finetune pipeline to preserve architectural priors while specializing TSFMs for downstream forecasting. By defining pruning units as input/output channels and employing loss-guided importance scores, the method progressively removes redundant components before fine-tuning, yielding improved accuracy and up to 7x faster inference. Across seven TSFMs and six benchmarks, the approach often outperforms full fine-tuning and strong specialized baselines, with notable zero-shot transfer benefits within related domains. This work highlights architectural specialization as a practical route to unlock TSFMs' potential in real-world forecasting tasks.

Abstract

Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting task. Through empirical studies on various TSFMs, the pre-trained models often exhibit inherent sparsity and redundancy in computation, suggesting that TSFMs have learned to activate task-relevant network substructures to accommodate diverse forecasting tasks. To preserve this valuable prior knowledge, we propose a structured pruning method to regularize the subsequent fine-tuning process by focusing it on a more relevant and compact parameter space. Extensive experiments on seven TSFMs and six benchmarks demonstrate that fine-tuning a smaller, pruned TSFM significantly improves forecasting performance compared to fine-tuning original models. This prune-then-finetune paradigm often enables TSFMs to achieve state-of-the-art performance and surpass strong specialized baselines. Source code is made publicly available at https://github.com/SJTU-DMTai/Prune-then-Finetune.

Paper Structure

This paper contains 40 sections, 5 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Performance comparison between TSFMs and PatchTST trained on full-shot training data of each benchmark. We calculate the average MSE of forecasting 96, 192, 336, and 720 steps, which is further divided by that of PatchTST with well-tuned hyperparameters qiu2024tfb. Weather is used for pre-training TimesFM and is not evaluated. The red dashed line represents the PatchTST baseline.
  • Figure 2: Cumulative distribution of the average relative output norm of one attention head over the Weather and ETTm1 datasets.
  • Figure 3: Cumulative distribution of the activation probability of one FFN intermediate channel over the Weather and ETTm1 datasets.
  • Figure 4: Proportion of sparsely activated channels in each FFN layer. A channel is identified as sparsely activated if its activation probability is less than 5% over the downstream dataset.
  • Figure 5: (Left) Boxplot of importance scores, where the largest box encompasses the middle 50% of the data distribution. (Right) Cumulative distribution of importance scores. The scores are calculated based on the Weather dataset and further divided by the maximum value. Distributions on the ETTm1 dataset are provided in Fig. \ref{['fig:score_ETTm1']}.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3