Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

Fabian Muşat; Simona Căbuz

Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

Fabian Muşat, Simona Căbuz

TL;DR

Switch-Hurdle is introduced: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder that achieves state-of-the-art prediction performance while maintaining scalability.

Abstract

Intermittent demand, a pattern characterized by long sequences of zero sales punctuated by sporadic, non-zero values, poses a persistent challenge in retail and supply chain forecasting. Both traditional methods, such as ARIMA, exponential smoothing, or Croston variants, as well as modern neural architectures such as DeepAR and Transformer-based models often underperform on such data, as they treat demand as a single continuous process or become computationally expensive when scaled across many sparse series. To address these limitations, we introduce Switch-Hurdle: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder. The encoder uses a sparse Top-1 expert routing during the forward pass yet approximately dense in the backward pass via a straight-through estimator (STE). The decoder follows a cross-attention autoregressive design with a shared hurdle head that explicitly separates the forecasting task into two components: a binary classification component estimating the probability of a sale, and a conditional regression component, predicting the quantity given a sale. This structured separation enables the model to capture both occurrence and magnitude processes inherent to intermittent demand. Empirical results on the M5 benchmark and a large proprietary retail dataset show that Switch-Hurdle achieves state-of-the-art prediction performance while maintaining scalability.

Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

TL;DR

Abstract

Paper Structure (20 sections, 12 equations, 5 figures, 7 tables)

This paper contains 20 sections, 12 equations, 5 figures, 7 tables.

Introduction
Related Work
Transformers for time series forecasting
Mixture of Experts for time series forecasting
Foundation Models for time series forecasting
Method
Model Architecture
The Switch MoE Encoder
The Autoregressive Hurdle Decoder
Hurdle Head for Demand Distribution
Training Objectives
Probabilistic Objective.
Point-wise Hybrid Objective.
Experiments
Base results
...and 5 more sections

Figures (5)

Figure 1: Main architecture for the Switch-Hurdle Transformer. The encoder (left) uses Top-1 MoE routing with SwiGLU experts to extract specialized representations of demand and covariate embeddings. The decoder (right) applies cross-attention over the encoder’s context memory to generate step-wise probabilistic forecasts $(p_t^{+}, \mu_t, \alpha_t)$. Each step conditions on the previous prediction, future covariates, and positional embeddings, while the shared hurdle head jointly models zero-demand probability and the conditional Negative-Binomial distribution for positive demand.
Figure 2: Overall expert utilization by layer and dataset. Bars show the percentage of tokens routed to each expert for the two Switch-encoder layers (L0, L1) on M5 and internal 1P data after introducing the KL-to-uniform regularizer. The distribution is balanced without collapse while still reflecting dataset- and layer-specific specialization.
Figure 3: Qualitative comparison of 28-day forecasts on three representative M5 series: HOUSEHOLD_2_442_TX_1, HOBBIES_1_354_WI_1, and FOODS_2_336_WI_1. For each series we plot the actual daily demand (blue) together with predictions from PatchTST, TFT, and Switch-Hurdle. PatchTST and TFT tend to produce nearly flat forecasts and under-react to intermittent spikes, while Switch-Hurdle better tracks both the occurrence and magnitude of spikes and remains close to zero in no-demand periods.
Figure 4: Layer 0 expert specialization conditioned on demand regime. Each bar shows $P(e \mid \text{regime})$ and is normalized to 100% per regime (Zero, Low, Normal, Spike). Values are normalized per regime to correct for minor measurement drift.
Figure 5: Layer 1 expert specialization conditioned on demand regime. Normalization as in Figure \ref{['fig:exp_spec_l0']}. This pattern supports the claim that Top-1 STE yields sparse forward routing with dense gradient updates, promoting stable yet differentiated experts across layers.

Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

TL;DR

Abstract

Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (5)