Table of Contents
Fetching ...

TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin

TL;DR

TimeDistill tackles the efficiency gap in long-term time-series forecasting by transferring multi-scale temporal and multi-period frequency knowledge from heavy Transformer/CNN teachers to a lightweight MLP via cross-architecture knowledge distillation. It jointly distills predictions and intermediate features at multiple temporal scales and frequency bands, with a theoretical interpretation as mixup-based data augmentation. Empirically, TimeDistill yields up to 18.6% gains for the MLP, can surpass teachers on eight datasets, and achieves up to 7x faster inference with 130x fewer parameters. The work demonstrates the practicality and versatility of distilling cross-architecture knowledge for efficient forecasting across diverse datasets and architectures.

Abstract

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

TL;DR

TimeDistill tackles the efficiency gap in long-term time-series forecasting by transferring multi-scale temporal and multi-period frequency knowledge from heavy Transformer/CNN teachers to a lightweight MLP via cross-architecture knowledge distillation. It jointly distills predictions and intermediate features at multiple temporal scales and frequency bands, with a theoretical interpretation as mixup-based data augmentation. Empirically, TimeDistill yields up to 18.6% gains for the MLP, can surpass teachers on eight datasets, and achieves up to 7x faster inference with 130x fewer parameters. The work demonstrates the practicality and versatility of distilling cross-architecture knowledge for efficient forecasting across diverse datasets and architectures.

Abstract

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

Paper Structure

This paper contains 47 sections, 4 theorems, 20 equations, 22 figures, 27 tables.

Key Result

theorem 1

Let $(x, y)$ denote original input data pairs and $(x, y^t)$ represent corresponding teacher data pairs. Consider a data augmentation function $\mathcal{A}(\cdot)$ applied to $(x, y)$, generating augmented samples $(x', y')$. Define the training loss on these augmented samples as $\mathcal{L}_{aug}

Figures (22)

  • Figure 1: Performance comparison.
  • Figure 2: Model efficiency comparison averaged across all prediction lengths (96, 192, 336, 720) for the ECL dataset. Full results on more datasets are listed in Appendix \ref{['app:efficiency']}.
  • Figure 3: Win ratio (%) of MLP v.s. teacher models across datasets under input-720-predict-96 setting. The win ratio is generally large (average: 49.92%, median: 49.96%), indicating MLP and teacher models excel on different samples with minimal overlap.
  • Figure 4: Visualization of model predictions on different downsampled scales of ECL dataset. MLP consistently shows poor performance at multiple scales, while other models perform well, highlighting the importance of capturing multi-scale patterns.
  • Figure 5: Prediction spectrograms of various models on ECL dataset against the ground truth. MLP fails to match the amplitudes of several main frequencies in the ground truth, with red numbers indicating amplitude differences for the most significant frequency.
  • ...and 17 more figures

Theorems & Definitions (4)

  • theorem 1
  • theorem 2
  • theorem 3
  • theorem 4