HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting
Shubao Zhao, Ming Jin, Zhaoxiang Hou, Chengyi Yang, Zengxiang Li, Qingsong Wen, Yi Wang
TL;DR
The paper tackles the challenge of multi-scale information in long-horizon time-series forecasting and proposes HiMTM, a framework that combines a hierarchical multi-scale transformer, a decoupled encoder-decoder, hierarchical self-distillation, and cross-scale attention fine-tuning. It uses a student-teacher pre-training setup with reconstruction and feature-level distillation losses to learn multi-scale representations, followed by cross-scale attention-based fine-tuning for forecasting. Extensive experiments across seven datasets demonstrate state-of-the-art performance in both in-domain and cross-domain settings, with ablations confirming the necessity of each component. The work demonstrates the potential of self-supervised, multi-scale, masked time-series modeling to yield robust, transferable forecasts for practical applications like energy demand forecasting.
Abstract
Time series forecasting is a critical and challenging task in practical application. Recent advancements in pre-trained foundation models for time series forecasting have gained significant interest. However, current methods often overlook the multi-scale nature of time series, which is essential for accurate forecasting. To address this, we propose HiMTM, a hierarchical multi-scale masked time series modeling with self-distillation for long-term forecasting. HiMTM integrates four key components: (1) hierarchical multi-scale transformer (HMT) to capture temporal information at different scales; (2) decoupled encoder-decoder (DED) that directs the encoder towards feature extraction while the decoder focuses on pretext tasks; (3) hierarchical self-distillation (HSD) for multi-stage feature-level supervision signals during pre-training; and (4) cross-scale attention fine-tuning (CSA-FT) to capture dependencies between different scales for downstream tasks. These components collectively enhance multi-scale feature extraction in masked time series modeling, improving forecasting accuracy. Extensive experiments on seven mainstream datasets show that HiMTM surpasses state-of-the-art self-supervised and end-to-end learning methods by a considerable margin of 3.16-68.54\%. Additionally, HiMTM outperforms the latest robust self-supervised learning method, PatchTST, in cross-domain forecasting by a significant margin of 2.3\%. The effectiveness of HiMTM is further demonstrated through its application in natural gas demand forecasting.
