Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

Mohammed Nowaz Rabbani Chowdhury; Hsinyu Tsai; Geoffrey W. Burr; Kaoutar El Maghraoui; Liu Liu; Meng Wang

Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

Mohammed Nowaz Rabbani Chowdhury, Hsinyu Tsai, Geoffrey W. Burr, Kaoutar El Maghraoui, Liu Liu, Meng Wang

TL;DR

This paper proposes a retraining-free heterogeneous computation framework in which noise-sensitive experts, which are provably identifiable by their maximum neuron norm, are computed digitally while the majority of the experts are executed on AIMC hardware.

Abstract

Sparse Mixture-of-Experts (MoE) models enable efficient scalability by activating only a small sub-set of experts per input, yet their massive parameter counts lead to substantial memory and energy inefficiency during inference. Analog in-memory computing (AIMC) offers a promising solution by eliminating frequent data movement between memory and compute units. However, mitigating hardware nonidealities of AIMC typically requires noise-aware retraining, which is infeasible for large MoE models. In this paper, we propose a retraining-free heterogeneous computation framework in which noise-sensitive experts, which are provably identifiable by their maximum neuron norm, are computed digitally while the majority of the experts are executed on AIMC hardware. We further assign densely activated modules, such as attention layers, to digital computation due to their high noise sensitivity despite comprising a small fraction of parameters. Extensive experiments on large MoE language models, including DeepSeekMoE and OLMoE, across multiple benchmark tasks validate the robustness of our approach in maintaining accuracy under analog nonidealities.

Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

TL;DR

Abstract

Paper Structure (22 sections, 7 theorems, 28 equations, 6 figures, 10 tables)

This paper contains 22 sections, 7 theorems, 28 equations, 6 figures, 10 tables.

Introduction
Related Works
Background
The Mixture-of-Experts Architecture
Analog In-Memory Computing of Neural Networks
The Proposed Heterogeneous Computation of MoE
Theoretical Support of Experts Selection
Key Theoretical Insights
Analytical Setup
Theoretical Generalization Guarantees
Experiments
Experimental Setup
Results on DAC-ADC Noise
Results on Weight-programming Noise
Energy Efficiency vs. Throughput vs. Accuracy Tradeoff in Heterogeneous Computation
...and 7 more sections

Key Result

Lemma 4.1

Suppose, the model in (th_model) is trained for $T=\Theta(l^2\sqrt{\log l}/\alpha)$ steps. For any $\mathbf{v}\in\{\mathbf{o}_1,\mathbf{o}_2\}$ and any expert $s,s^\prime$, such that $p_{\mathbf{v}}^{(s,T)}=1$ and $p_{\mathbf{-v}}^{(s^\prime,T)}=1$, we have

Figures (6)

Figure 1: a. In heterogeneous computing of MoE, the dense modules and noise-sensitive expert modules are computed in a digital accelerator while rest of the experts are computed in an analog accelerator. b. A schematic of digital and analog accelerator. c. The analog accelerator is comprised of non-volatile memory (NVM) tiles. Weights are programmed to the crossbar array of the tile.
Figure 2: The heterogeneous computation strategy for MoE models.
Figure 3: Effect of computing dense modules in analog. Accuracy degradation is significant when densely activated modules are executed on AIMC, despite their small parameter footprint.
Figure 4: Performance of different digital expert selection methods in OLMoE
Figure 5: Performance of different digital expert selection methods in DeepSeekMoE
...and 1 more figures

Theorems & Definitions (9)

Lemma 4.1
Theorem 4.2
Lemma 4.1: Lemma J.5(ii) of chowdhury2026efficient
Lemma 4.2: Lemma J.6(ii) of chowdhury2026efficient
Lemma 4.3: Lemma I.1(i) of chowdhury2026efficient
Lemma 5.1: Full version of Lemma \ref{['lm_m_1']}
proof
Theorem 6.1: Full version of Theorem \ref{['th_1']}
proof

Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

TL;DR

Abstract

Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)