Table of Contents
Fetching ...

Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting

Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen

TL;DR

Ada-MoGE addresses frequency coverage imbalance in time-series MoEs by combining adaptive Gaussian band-pass feature decoupling with a dual-feature gating mechanism that uses spectral intensity and cross-variable frequency response to determine the number of active experts. The model adaptively allocates experts to dominant frequency bands and suppresses noise, yielding improved accuracy across six benchmarks with only 0.2M parameters. Extensive ablations demonstrate the benefits of Gaussian feature decoupling, adaptive expert budgeting, and the importance of balanced expert counts, layers, and feature dimensionality. Overall, Ada-MoGE delivers state-of-the-art results with high efficiency, highlighting the value of frequency-aware routing in time-series forecasting.

Abstract

Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting

TL;DR

Ada-MoGE addresses frequency coverage imbalance in time-series MoEs by combining adaptive Gaussian band-pass feature decoupling with a dual-feature gating mechanism that uses spectral intensity and cross-variable frequency response to determine the number of active experts. The model adaptively allocates experts to dominant frequency bands and suppresses noise, yielding improved accuracy across six benchmarks with only 0.2M parameters. Extensive ablations demonstrate the benefits of Gaussian feature decoupling, adaptive expert budgeting, and the importance of balanced expert counts, layers, and feature dimensionality. Overall, Ada-MoGE delivers state-of-the-art results with high efficiency, highlighting the value of frequency-aware routing in time-series forecasting.

Abstract

Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Performance comparison of Ada-MoGE with other state-of-the-art Models. Figure (a) shows a radar map based on MSE which shows that AdaMoGE has achieved advanced performance on six public benchmarks. Figure (b) shows the parameters and FLOPs of Ada-MoGE versus other state-of-the-art models. The parameter of Ada-MoGE is only 0.2M, and the FLOPs are significantly less than those of the existing models. And the MAE on ETTh1 of our model is significantly lower than that of other models.
  • Figure 2: The overview of the Ada-MoGE method. Ada-MoGE first adopts Gaussian band-pass filtering to decouple the frequency domain features, and different experts process the features of different frequency bands. Furthermore, to capture the dominant frequency band and filter out the noise frequency band, an adaptive learner based on two-dimensional feature is designed to learn the number of dominant experts.
  • Figure 3: Comparison of 96-step Forecasts by FreqMOE, TimeMixer, and Ada-MoGE on the ETTm2 Dataset. GroundTruth (blue) versus forecasts (orange).
  • Figure 4: Performance comparison of Ada-MoGE versus Freq-MOE modules.
  • Figure 5: Hyperparameter sensitivity analysis of Ada-MoGE on ETT datasets.