Table of Contents
Fetching ...

Swimba: Switch Mamba Model Scales State Space Models

Zhixu Du, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath, Hai Helen Li, Yiran Chen

TL;DR

The results suggest that parameter-space MoE can increase SSM capacity while keeping the dominant recurrence cost fixed, and establish well-definedness and stability for MoE-parameterized SSMs.

Abstract

Mixture-of-experts (MoE) is a common approach for increasing parameter capacity, but applying MoE to state space model (SSM) token mixers can multiply the cost of the recurrent state update. We study how to introduce expert specialization into selective SSMs while preserving computational efficiency. We show that MoE--SSM can refer to two designs: (1) MoE over separated SSMs, which maintains multiple state trajectories and thus scales compute with the number of experts; and (2) MoE-parameterized SSM, which mixes experts in parameter space, maintains a single state trajectory, and evaluates the recurrence once. Our method, Switch Mamba (Swimba), follows the second design by routing over expert-produced SSM streams. Theoretically, we establish well-definedness and stability for MoE-parameterized SSMs and characterize the relationship between the two designs. Empirically, we evaluate Swimba on standard benchmark tasks and measure real-time throughput and latency. Under matched FLOPs, Swimba achieves slightly better average performance than the baseline, with a small slowdown in real-time latency and throughput. Overall, these results suggest that parameter-space MoE can increase SSM capacity while keeping the dominant recurrence cost fixed.

Swimba: Switch Mamba Model Scales State Space Models

TL;DR

The results suggest that parameter-space MoE can increase SSM capacity while keeping the dominant recurrence cost fixed, and establish well-definedness and stability for MoE-parameterized SSMs.

Abstract

Mixture-of-experts (MoE) is a common approach for increasing parameter capacity, but applying MoE to state space model (SSM) token mixers can multiply the cost of the recurrent state update. We study how to introduce expert specialization into selective SSMs while preserving computational efficiency. We show that MoE--SSM can refer to two designs: (1) MoE over separated SSMs, which maintains multiple state trajectories and thus scales compute with the number of experts; and (2) MoE-parameterized SSM, which mixes experts in parameter space, maintains a single state trajectory, and evaluates the recurrence once. Our method, Switch Mamba (Swimba), follows the second design by routing over expert-produced SSM streams. Theoretically, we establish well-definedness and stability for MoE-parameterized SSMs and characterize the relationship between the two designs. Empirically, we evaluate Swimba on standard benchmark tasks and measure real-time throughput and latency. Under matched FLOPs, Swimba achieves slightly better average performance than the baseline, with a small slowdown in real-time latency and throughput. Overall, these results suggest that parameter-space MoE can increase SSM capacity while keeping the dominant recurrence cost fixed.
Paper Structure (28 sections, 5 theorems, 38 equations, 5 figures, 2 tables)

This paper contains 28 sections, 5 theorems, 38 equations, 5 figures, 2 tables.

Key Result

Theorem 1

Fix $T,N,P,E$. Assume $A\in\mathbb{R}^{N\times N}$ and, for each $t$ and $e$, streams $B_t^{(e)},C_t^{(e)}\in\mathbb{R}^{N\times P}$ and $X_t^{(e)}\in\mathbb{R}^P$ are given. Let $\pi_t\in\mathbb{R}^E$ satisfy $\pi_{t,e}\ge 0$ and let $\mathcal{K}_t\subseteq[E]$ denote the active set (dense or top-$ Then eq:moe_state_method--eq:moe_out_method is exactly a single selective SSM with state size $N$,

Figures (5)

  • Figure 1: Compute scaling view for token mixing. The x-axis is sequence length $L$ and the y-axis is per-token computational overhead. Attention scales as $O(L^2)$ while SSMs scale as $O(L)$. Bubble area indicates parameter count, illustrating that MoE can increase parameters with only a small increase in per-token compute, motivating Swimba (MoE-SSM) designs.
  • Figure 2: Two MoE--SSM designs with different state and compute scaling. (a) MoE-parameterized SSM (parameter-space mixing) keeps a single hidden-state trajectory and mixes expert-produced streams before a single recurrence evaluation. (b) MoE over separated SSMs maintains one state trajectory per expert and combines expert outputs, which requires advancing multiple recurrences resulting computation scales with number of experts.
  • Figure 3: Swimba layer: an MoE-parameterized SSM token mixer built from a Mamba-2 block. Each expert applies its own input projections to produce candidate token-dependent streams (e.g., $B_t^{(e)}, C_t^{(e)}, X_t^{(e)}$). At each layer, only a subset of experts is selected per token. The selected experts are combined by a weighted sum to form a single stream for the SSM computation.
  • Figure 4: Decoding throughput on vLLM versus batch size for multiple input-output sequence lengths, with the inset reporting the throughput ratio (Swimba/Nemotron). Swimba-14B remains close to the Nemotron-H-8B baseline in throughput with a slight drop.
  • Figure 5: Latency on vLLM versus input-output sequence length for multiple batch sizes, with the inset reporting the ratio (Swimba/Nemotron). Latency trends are similar across sequence lengths, demonstrating slightly worse end-to-end cost between Swimba-14B and Nemotron-H-8B.

Theorems & Definitions (10)

  • Theorem 1: Single-SSM structure under MoE routing
  • Theorem 2: Recurrence complexity does not scale with $E$
  • Theorem 3: BIBO stability under a contractive transition
  • Theorem 4: Separated vs parameter-mixed: equality regime and mismatch bound
  • Theorem 5: Strict expressivity gain with one recurrence
  • proof
  • proof
  • proof
  • proof
  • proof