Table of Contents
Fetching ...

Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

Fabian Muşat, Simona Căbuz

TL;DR

Switch-Hurdle is introduced: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder that achieves state-of-the-art prediction performance while maintaining scalability.

Abstract

Intermittent demand, a pattern characterized by long sequences of zero sales punctuated by sporadic, non-zero values, poses a persistent challenge in retail and supply chain forecasting. Both traditional methods, such as ARIMA, exponential smoothing, or Croston variants, as well as modern neural architectures such as DeepAR and Transformer-based models often underperform on such data, as they treat demand as a single continuous process or become computationally expensive when scaled across many sparse series. To address these limitations, we introduce Switch-Hurdle: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder. The encoder uses a sparse Top-1 expert routing during the forward pass yet approximately dense in the backward pass via a straight-through estimator (STE). The decoder follows a cross-attention autoregressive design with a shared hurdle head that explicitly separates the forecasting task into two components: a binary classification component estimating the probability of a sale, and a conditional regression component, predicting the quantity given a sale. This structured separation enables the model to capture both occurrence and magnitude processes inherent to intermittent demand. Empirical results on the M5 benchmark and a large proprietary retail dataset show that Switch-Hurdle achieves state-of-the-art prediction performance while maintaining scalability.

Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

TL;DR

Switch-Hurdle is introduced: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder that achieves state-of-the-art prediction performance while maintaining scalability.

Abstract

Intermittent demand, a pattern characterized by long sequences of zero sales punctuated by sporadic, non-zero values, poses a persistent challenge in retail and supply chain forecasting. Both traditional methods, such as ARIMA, exponential smoothing, or Croston variants, as well as modern neural architectures such as DeepAR and Transformer-based models often underperform on such data, as they treat demand as a single continuous process or become computationally expensive when scaled across many sparse series. To address these limitations, we introduce Switch-Hurdle: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder. The encoder uses a sparse Top-1 expert routing during the forward pass yet approximately dense in the backward pass via a straight-through estimator (STE). The decoder follows a cross-attention autoregressive design with a shared hurdle head that explicitly separates the forecasting task into two components: a binary classification component estimating the probability of a sale, and a conditional regression component, predicting the quantity given a sale. This structured separation enables the model to capture both occurrence and magnitude processes inherent to intermittent demand. Empirical results on the M5 benchmark and a large proprietary retail dataset show that Switch-Hurdle achieves state-of-the-art prediction performance while maintaining scalability.
Paper Structure (20 sections, 12 equations, 5 figures, 7 tables)

This paper contains 20 sections, 12 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Main architecture for the Switch-Hurdle Transformer. The encoder (left) uses Top-1 MoE routing with SwiGLU experts to extract specialized representations of demand and covariate embeddings. The decoder (right) applies cross-attention over the encoder’s context memory to generate step-wise probabilistic forecasts $(p_t^{+}, \mu_t, \alpha_t)$. Each step conditions on the previous prediction, future covariates, and positional embeddings, while the shared hurdle head jointly models zero-demand probability and the conditional Negative-Binomial distribution for positive demand.
  • Figure 2: Overall expert utilization by layer and dataset. Bars show the percentage of tokens routed to each expert for the two Switch-encoder layers (L0, L1) on M5 and internal 1P data after introducing the KL-to-uniform regularizer. The distribution is balanced without collapse while still reflecting dataset- and layer-specific specialization.
  • Figure 3: Qualitative comparison of 28-day forecasts on three representative M5 series: HOUSEHOLD_2_442_TX_1, HOBBIES_1_354_WI_1, and FOODS_2_336_WI_1. For each series we plot the actual daily demand (blue) together with predictions from PatchTST, TFT, and Switch-Hurdle. PatchTST and TFT tend to produce nearly flat forecasts and under-react to intermittent spikes, while Switch-Hurdle better tracks both the occurrence and magnitude of spikes and remains close to zero in no-demand periods.
  • Figure 4: Layer 0 expert specialization conditioned on demand regime. Each bar shows $P(e \mid \text{regime})$ and is normalized to 100% per regime (Zero, Low, Normal, Spike). Values are normalized per regime to correct for minor measurement drift.
  • Figure 5: Layer 1 expert specialization conditioned on demand regime. Normalization as in Figure \ref{['fig:exp_spec_l0']}. This pattern supports the claim that Top-1 STE yields sparse forward routing with dense gradient updates, promoting stable yet differentiated experts across layers.