Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

Raeid Saqur; Anastasis Kratsios; Florian Krach; Yannick Limmer; Jacob-Junqi Tian; John Willes; Blanka Horvath; Frank Rudzicz

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

Raeid Saqur, Anastasis Kratsios, Florian Krach, Yannick Limmer, Jacob-Junqi Tian, John Willes, Blanka Horvath, Frank Rudzicz

TL;DR

MoE-F introduces an online gating framework for a mixture of expert LLMs in time-series tasks by modeling the best-performing expert as a hidden Markov process and applying the Wonham-Shiryaev stochastic filter to update expert weights in real time. The algorithm runs $N$ parallel filters and then robustly aggregates their predictions using a Gibbs-style softmin, with a second-stage update of the gating intensity matrix $Q$ via a regularized log of a perturbed transition matrix. The authors provide formal optimality guarantees for both the parallel filtering and the robust aggregation steps and demonstrate substantial empirical gains on financial market movement prediction (notably a 17% absolute improvement in F1 over the best single expert) and long-horizon time-series forecasting. The approach offers a principled, online alternative to static MoE routing, enabling dynamic adaptation to regime shifts and heterogeneous expert strengths with practical impact in finance and beyond.

Abstract

We propose MoE-F - a formalized mechanism for combining $N$ pre-trained Large Language Models (LLMs) for online time-series prediction by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert's running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, our approach employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the $N$ individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are optimally aggregated to maximize their robust predictive power, and this update is computed efficiently via a closed-form expression, generating our ensemble predictor. Our contributions are: **(I)** the MoE-F plug-and-play filtering harness algorithm, **(II)** theoretical optimality guarantees of the proposed filtering-based gating algorithm (via optimality guarantees for its parallel Bayesian filtering and its robust aggregation steps), and **(III)** empirical evaluation and ablative results using state-of-the-art foundational and MoE LLMs on a real-world __Financial Market Movement__ task where MoE-F attains a remarkable 17\% absolute and 48.5\% relative F1 measure improvement over the next best performing individual LLM expert predicting short-horizon market movement based on streaming news. Further, we provide empirical evidence of substantial performance gains in applying MoE-F over specialized models in the long-horizon time-series forecasting domain.

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

TL;DR

parallel filters and then robustly aggregates their predictions using a Gibbs-style softmin, with a second-stage update of the gating intensity matrix

via a regularized log of a perturbed transition matrix. The authors provide formal optimality guarantees for both the parallel filtering and the robust aggregation steps and demonstrate substantial empirical gains on financial market movement prediction (notably a 17% absolute improvement in F1 over the best single expert) and long-horizon time-series forecasting. The approach offers a principled, online alternative to static MoE routing, enabling dynamic adaptation to regime shifts and heterogeneous expert strengths with practical impact in finance and beyond.

Abstract

We propose MoE-F - a formalized mechanism for combining

pre-trained Large Language Models (LLMs) for online time-series prediction by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert's running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, our approach employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the

individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are optimally aggregated to maximize their robust predictive power, and this update is computed efficiently via a closed-form expression, generating our ensemble predictor. Our contributions are: **(I)** the MoE-F plug-and-play filtering harness algorithm, **(II)** theoretical optimality guarantees of the proposed filtering-based gating algorithm (via optimality guarantees for its parallel Bayesian filtering and its robust aggregation steps), and **(III)** empirical evaluation and ablative results using state-of-the-art foundational and MoE LLMs on a real-world __Financial Market Movement__ task where MoE-F attains a remarkable 17\% absolute and 48.5\% relative F1 measure improvement over the next best performing individual LLM expert predicting short-horizon market movement based on streaming news. Further, we provide empirical evidence of substantial performance gains in applying MoE-F over specialized models in the long-horizon time-series forecasting domain.

Paper Structure (66 sections, 9 theorems, 84 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 66 sections, 9 theorems, 84 equations, 9 figures, 9 tables, 2 algorithms.

Introduction
Contributions
Outline
Preliminaries
Helper Functions
The MoE Filtering (MoE-F) Algorithm
The MoE Filtering
Loss functions
Helper Functions
Step 1 -- Optimal Parallel Filtering
Step 2 -- Robust Aggregation
Theoretical Guarantees
Guarantees for Step 1 - Online Parallel Filtering
Guarantees for Step 2 - Robust Aggregation
Experiments
...and 51 more sections

Key Result

Theorem 1

Under Assumption ass:regularity_SW_filter, the best a posteriori estimate of the $n^{th}$ expert, $\pi_t^{(n)}$, satisfies the SDE where $(Q_t)_{i}$ denotes the $i^{th}$ row of the transitions matrix $Q_t$ at time $t\ge 0$, $w_0^i\,\raisebox{-1pt}{$\stackrel{\hbox{\upshape\tiny def.}}{=}$}\, \mathbb{P}(w_0=e_i)$. The "innovations process" $\overline{W}_{\cdot}^{(n)}\,\raisebox{-1pt}{$\stackrel{\h

Figures (9)

Figure 1: A visualization of MoE-F 's application as a filtering harness. Depicts seven SOTA LLMs predicting market movement direction over three randomly sampled windows of seven (trading) days across varying market regimes --- left: mixed market with high fluctuations, middle: neutral, and right: bearish market. In all sub-plots, the ground-truth (market) trajectory is in black, and the filtered trajectory is depicted in dotted green. All other experts' (Table \ref{['tab:moe-llms-nifty-results']}) predictions are overlayed as scatter-plot points. No values for non-trading days.
Figure 1: Statistics of NIFTY test split
Figure 2: MoE-F Mechanism: conceptual depiction of an input signal $x_{\cdot}$ evolving in $\mathbb{R}^d$ with $N$ experts ($\pi$).
Figure 3: Example snapshot of the 'news' component on 2020-02-06, at the upstart of the global coronavirus epidemic (the text colors here convey negative and positive sentiments). An expert policy, $\pi_{LM}$'s prompt is composed of a task instruction as prefix, concatenated with the market context, and this news value concatenated: $s.t.$$x_p \leftarrow (x_{prefix}; x_{context}; x_{news})$.
Figure 4: Heatmap of expert weights and subsequent rankings for the sampled windows in Fig. \ref{['fig:trajectory_moe-f']}.
...and 4 more figures

Theorems & Definitions (18)

Theorem 1: Optimal Optimistic Prior for $n^{th}$ Expert
Theorem 2: Bi-level Robust Updates to the $Q$-Matrix
Proposition 1: Stability of Perturbations
Theorem 3: Optimal Optimistic Prior for $n^{th}$ Expert - Squared Loss Case
proof
Remark
Theorem 4: Optimal Optimistic Prior for $n^{th}$ Expert - Squared Loss Case
proof
Remark
Proposition 2: Regularity of Perturbations
...and 8 more

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

TL;DR

Abstract

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (18)