Table of Contents
Fetching ...

Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics

Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

TL;DR

This work addresses the need for task-aware modeling in time series analytics by integrating a mixture-of-experts (MoE) framework with task-sensitive routing. It introduces PatchMoE, featuring a Recurrent Noisy Gating (RNG-Router) mechanism that leverages hierarchical representations across Transformer layers and routes time series tokens across both temporal and channel dimensions, guided by a Temporal & Channel Load Balancing Loss. The approach achieves state-of-the-art results across five downstream tasks (forecasting, anomaly detection, imputation, and classification) on diverse datasets, and ablations confirm the substantial contributions of RNG-Router, shared versus routed experts, and the loss terms. These findings suggest PatchMoE as a scalable, general backbone for time series analytics with strong task-awareness and interpretability of routing behavior, with practical implications for real-world data processing and analytics pipelines.

Abstract

Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.

Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics

TL;DR

This work addresses the need for task-aware modeling in time series analytics by integrating a mixture-of-experts (MoE) framework with task-sensitive routing. It introduces PatchMoE, featuring a Recurrent Noisy Gating (RNG-Router) mechanism that leverages hierarchical representations across Transformer layers and routes time series tokens across both temporal and channel dimensions, guided by a Temporal & Channel Load Balancing Loss. The approach achieves state-of-the-art results across five downstream tasks (forecasting, anomaly detection, imputation, and classification) on diverse datasets, and ablations confirm the substantial contributions of RNG-Router, shared versus routed experts, and the loss terms. These findings suggest PatchMoE as a scalable, general backbone for time series analytics with strong task-awareness and interpretability of routing behavior, with practical implications for real-world data processing and analytics pipelines.

Abstract

Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.

Paper Structure

This paper contains 33 sections, 11 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Representation analytics in Forecasting (Weather input-96-predict-336; MSE), Anomaly Detection (SMD; F1-Score), Imputation (Electricity Mask 37.5%; MSE), and Classification (PEMS-SF; Accuracy). For each model, we calculate the CKA similarity (refer to the vertical axis corresponding to the columns) between representations from the first and the last layers, and mark the performance of each task at the top of columns. Stronger models show more distinguishable CKA simiarlities across different tasks.
  • Figure 2: The overview of PatchMoE. The time series is first normalized and tokenized to make time series "tokens". In the $L$-stacked Transformer layers, the time series tokens are then processed through Multi-head Self-Attention (MSA) mechanism to obtain the representations. In the $l$-th layer, the RNG-Router takes the $X_{E_l} \in \mathbb{R}^{(N\times n) \times d}$ and the hidden state $h_{l-1}\in\mathbb{R}^{(N\times n) \times d}$ as inputs, utilizes the task-specific characteristics inside them to effectively route the experts. The Temporal & Channel Load Balancing Loss is designed to encourage the modeling of sparse temporal and channel correlations, which can enhance the temporal semantics and construct better Channel Strategies between CI and CD. See red and green tokens, encouraged by the Temporal & Channel Load Balancing Loss, green ones indicates that tokens are routed to different group of experts for balance.
  • Figure 3: Model comparison in univariate forecasting. The msMAPE results are average from 8,068 univariate time series in TFB (lower is better). See Table \ref{['Common univariate forecasting results all']} in Appendix \ref{['app: results']} for full results.
  • Figure 4: Model comparison in classification. The accuracy are averaged from 10 subsets from UEA. See Table \ref{['tab:full_classification_results']} in Appendix \ref{['app: results']} for full results.
  • Figure 5: Model Performance comparision in five tasks.
  • ...and 2 more figures