Table of Contents
Fetching ...

Multi-Scale Finetuning for Encoder-based Time Series Foundation Models

Zhongzheng Qiao, Chenghao Liu, Yiming Zhang, Ming Jin, Quang Pham, Qingsong Wen, P. N. Suganthan, Xudong Jiang, Savitha Ramasamy

TL;DR

This work tackles the challenge of finetuning encoder-based Time Series Foundation Models (TSFMs) for downstream tasks. It introduces Multiscale Finetuning (MSFT), a causally informed framework that explicitly models multiple sampling scales via scale-specific adapters, decoupled within-scale and cross-scale dependencies, and learned multi-scale prediction mixing. By aligning finetuning with the interventional distribution $P(Y|do(X))$ through backdoor adjustment, MSFT mitigates confounding from scale and leverages pretrained multi-scale knowledge to boost forecasting accuracy. Empirically, MSFT consistently outperforms naive finetuning and PEFT baselines across long sequence and probabilistic forecasting tasks on multiple backbones (Moirai, Moment, UniTS), often surpassing state-of-the-art models trained from scratch while maintaining reasonable efficiency. This approach advances practical deployment of TSFMs by enabling scalable, robust fine-tuning that respects the inherent multi-scale nature of time series data.

Abstract

Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on encoder-based TSFMs, we propose Multiscale finetuning (MSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (Moirai, Moment and Units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods. Codes are available at https://github.com/zqiao11/MSFT.

Multi-Scale Finetuning for Encoder-based Time Series Foundation Models

TL;DR

This work tackles the challenge of finetuning encoder-based Time Series Foundation Models (TSFMs) for downstream tasks. It introduces Multiscale Finetuning (MSFT), a causally informed framework that explicitly models multiple sampling scales via scale-specific adapters, decoupled within-scale and cross-scale dependencies, and learned multi-scale prediction mixing. By aligning finetuning with the interventional distribution through backdoor adjustment, MSFT mitigates confounding from scale and leverages pretrained multi-scale knowledge to boost forecasting accuracy. Empirically, MSFT consistently outperforms naive finetuning and PEFT baselines across long sequence and probabilistic forecasting tasks on multiple backbones (Moirai, Moment, UniTS), often surpassing state-of-the-art models trained from scratch while maintaining reasonable efficiency. This approach advances practical deployment of TSFMs by enabling scalable, robust fine-tuning that respects the inherent multi-scale nature of time series data.

Abstract

Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on encoder-based TSFMs, we propose Multiscale finetuning (MSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (Moirai, Moment and Units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods. Codes are available at https://github.com/zqiao11/MSFT.

Paper Structure

This paper contains 68 sections, 10 equations, 9 figures, 15 tables, 2 algorithms.

Figures (9)

  • Figure 1: (a) Multi-scale property in time series foundation model (TSFM) finetuning. Finetuning TSFMs on the original scale may overlook potential temporal patterns in time series and underutilize their multi-scale forecasting capabilities learned during pretraining. (b) Causal graph for forecasting of TSFMs. Nodes denote the abstract data variables and directed edges denote the causality, i.e. cause $\rightarrow$ effect. Scale $S$ acts as a confounder, influencing both input context series $X$ and model's activated knowledge $M$ (shown in red).
  • Figure 2: (a): The intervened Structural Causal Models (SCM) and overall MultiScale FineTuning (MSFT) framework, which directly model $P(Y|do(X))$; (b): Challenges in directly applying the framework. Left: Downsampling and patching process for constructing multi-scale sequences. Patch tokens at different scales have varying resolution and schematics. Right: Directly applying self-attention over multi-scale embeddings leads to biased cross-scale attention due to misaligned time id.
  • Figure 3: Complete design of MSFT based on the overall framework in Figure \ref{['fig:causal']}(a). [mycircled, deepred]a11 Linear adapters are attached to the frozen input projection to learn scale-variant input embeddings. [mycircled, deepred]a22 Self-attention layers incorperate scale-specific Lora and decoupled dependency modeling. [mycircled]a1I In-scale attention employs in-scale masking, ensuring tokens attend only to others within the same scale. [mycircled]a2II Cross-scale aggregators progressively fuse tokens across scales in two directions, ensuring correct temporal alignment between tokens. [mycircled, deepred]a33 Output projection generates separate predictions for each scale, which are then mixed by up-sampling and learned weights.
  • Figure 4: LSF accuracy w.r.t. number of scales
  • Figure 5: Attention heatmaps of various methods
  • ...and 4 more figures