Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics
Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
TL;DR
This work addresses the need for task-aware modeling in time series analytics by integrating a mixture-of-experts (MoE) framework with task-sensitive routing. It introduces PatchMoE, featuring a Recurrent Noisy Gating (RNG-Router) mechanism that leverages hierarchical representations across Transformer layers and routes time series tokens across both temporal and channel dimensions, guided by a Temporal & Channel Load Balancing Loss. The approach achieves state-of-the-art results across five downstream tasks (forecasting, anomaly detection, imputation, and classification) on diverse datasets, and ablations confirm the substantial contributions of RNG-Router, shared versus routed experts, and the loss terms. These findings suggest PatchMoE as a scalable, general backbone for time series analytics with strong task-awareness and interpretability of routing behavior, with practical implications for real-world data processing and analytics pipelines.
Abstract
Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.
