HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting

Shubao Zhao; Ming Jin; Zhaoxiang Hou; Chengyi Yang; Zengxiang Li; Qingsong Wen; Yi Wang

HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting

Shubao Zhao, Ming Jin, Zhaoxiang Hou, Chengyi Yang, Zengxiang Li, Qingsong Wen, Yi Wang

TL;DR

The paper tackles the challenge of multi-scale information in long-horizon time-series forecasting and proposes HiMTM, a framework that combines a hierarchical multi-scale transformer, a decoupled encoder-decoder, hierarchical self-distillation, and cross-scale attention fine-tuning. It uses a student-teacher pre-training setup with reconstruction and feature-level distillation losses to learn multi-scale representations, followed by cross-scale attention-based fine-tuning for forecasting. Extensive experiments across seven datasets demonstrate state-of-the-art performance in both in-domain and cross-domain settings, with ablations confirming the necessity of each component. The work demonstrates the potential of self-supervised, multi-scale, masked time-series modeling to yield robust, transferable forecasts for practical applications like energy demand forecasting.

Abstract

Time series forecasting is a critical and challenging task in practical application. Recent advancements in pre-trained foundation models for time series forecasting have gained significant interest. However, current methods often overlook the multi-scale nature of time series, which is essential for accurate forecasting. To address this, we propose HiMTM, a hierarchical multi-scale masked time series modeling with self-distillation for long-term forecasting. HiMTM integrates four key components: (1) hierarchical multi-scale transformer (HMT) to capture temporal information at different scales; (2) decoupled encoder-decoder (DED) that directs the encoder towards feature extraction while the decoder focuses on pretext tasks; (3) hierarchical self-distillation (HSD) for multi-stage feature-level supervision signals during pre-training; and (4) cross-scale attention fine-tuning (CSA-FT) to capture dependencies between different scales for downstream tasks. These components collectively enhance multi-scale feature extraction in masked time series modeling, improving forecasting accuracy. Extensive experiments on seven mainstream datasets show that HiMTM surpasses state-of-the-art self-supervised and end-to-end learning methods by a considerable margin of 3.16-68.54\%. Additionally, HiMTM outperforms the latest robust self-supervised learning method, PatchTST, in cross-domain forecasting by a significant margin of 2.3\%. The effectiveness of HiMTM is further demonstrated through its application in natural gas demand forecasting.

HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting

TL;DR

Abstract

Paper Structure (19 sections, 11 equations, 9 figures, 4 tables)

This paper contains 19 sections, 11 equations, 9 figures, 4 tables.

Introduction
Related Works
Time Series Forecasting
Time Series Self-supervised Learning
Knowledge Distillation
Method
Overall Architecture
Hierarchical Multi-scale Transformer
Model Pre-training
Model Fine-tuning
Experiments
Experimental Setup
Main Results
Ablation Study
Cross-domain Forecasting
...and 4 more sections

Figures (9)

Figure 1: Illustration of the multi-scale phenomenon on the Electricity dataset.
Figure 2: The overall architecture of HiMTM partitions time series data into visible and masked parts, which are processed by both the student and teacher encoders. The teacher encoder, sharing identical network parameters with the student, performs feedforward operations without backpropagation, denoted by "sg" for stop gradient. In the decoder, "q", "k", and "v" represent the query, key, and value components, respectively. Additionally, $\mathcal{L}_{Ri}$ and $\mathcal{L}_{Di}$ denote the Patch-level Reconstruction Loss and Feature-level Distillation Loss for each hierarchy $i$, respectively.
Figure 3: Fine-tuning the pre-trained HiMTM.
Figure 4: Component ablation of HiMTM: HMT, DED, HSD, and CSA-FT on ETTh1 and ETTh2.
Figure 5: Forecasting performance with varying masking ratios $M = \{0.1, 0.3, 0.5, 0.7, 0.9\}$ for different prediction horizons.
...and 4 more figures

HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting

TL;DR

Abstract

HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)