$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

Guanjie Chen; Xinyu Zhao; Tianlong Chen; Yu Cheng

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

Guanjie Chen, Xinyu Zhao, Tianlong Chen, Yu Cheng

TL;DR

MoE-RBench introduces a comprehensive reliability benchmark for Sparse Mixture-of-Experts models, evaluating safety, hallucination, adversarial robustness, and OOD performance. By analyzing multiple MoE architectures and training/inference strategies, the work demonstrates that with appropriate router tuning, data augmentation, and decoding techniques, MoE models can achieve reliability on par with or surpass dense LLMs, especially under adversarial and distribution-shift conditions. The study also highlights that routing dynamics, expert dropout, and load-balancing losses substantially influence robustness, offering practical guidance for deploying MoE in high-security tasks. Overall, MoE-RBench provides actionable insights and datasets to advance trustworthy MoE-based language models in real-world settings.

Abstract

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, the reliability assessment of MoE lags behind its surging applications. Moreover, when transferred to new domains such as in fine-tuning MoE models sometimes underperform their dense counterparts. Motivated by the research gap and counter-intuitive phenomenon, we propose $\texttt{MoE-RBench}$, the first comprehensive assessment of SMoE reliability from three aspects: $\textit{(i)}$ safety and hallucination, $\textit{(ii)}$ resilience to adversarial attacks, and $\textit{(iii)}$ out-of-distribution robustness. Extensive models and datasets are tested to compare the MoE to dense networks from these reliability dimensions. Our empirical observations suggest that with appropriate hyperparameters, training recipes, and inference techniques, we can build the MoE model more reliably than the dense LLM. In particular, we find that the robustness of SMoE is sensitive to the basic training settings. We hope that this study can provide deeper insights into how to adapt the pre-trained MoE model to other tasks with higher-generation security, quality, and stability. Codes are available at https://github.com/UNITES-Lab/MoE-RBench

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

TL;DR

Abstract

, the first comprehensive assessment of SMoE reliability from three aspects:

safety and hallucination,

resilience to adversarial attacks, and

out-of-distribution robustness. Extensive models and datasets are tested to compare the MoE to dense networks from these reliability dimensions. Our empirical observations suggest that with appropriate hyperparameters, training recipes, and inference techniques, we can build the MoE model more reliably than the dense LLM. In particular, we find that the robustness of SMoE is sensitive to the basic training settings. We hope that this study can provide deeper insights into how to adapt the pre-trained MoE model to other tasks with higher-generation security, quality, and stability. Codes are available at https://github.com/UNITES-Lab/MoE-RBench

Paper Structure (32 sections, 2 equations, 7 figures, 11 tables)

This paper contains 32 sections, 2 equations, 7 figures, 11 tables.

Introduction
Related Works
Sparse Mixture-of-Experts (SMoE).
Reliability Evaluation of LLMs.
Preliminary
Sparse Mixture of Experts
MoE Model Architectures
MoE-RBench: how reliable is the MoE?
Safety and Hallucination Evaluation
Evaluation Datasets and Metrics
Implementation Details
Evaluation Results
Adversarial Robustness Evaluation
Evaluation Datasets and Metrics
Implementation Details
...and 17 more sections

Figures (7)

Figure 1: Overall reliability evaluation of sparse neural networks. Left figure is an overview of MoE-RBench dimensions. Right figures show the full-scale performance (%) of MoE model MoLM-350M-K2 compared to its dense counterpart with similar architecture and activated parameter size pythia-410M, where outer cycles indicate superior performance. Each metric in the Right figures explained: the Clean and Adversarial Accuracy (Acc.) are achieved on SNLI; the OOD Accuracy (Acc.) is the average performance on SST-2 of all OOD transformations; Harmlessness metric is from $1$ minus the average of OpenAI Moderation scores on all safety datasets; TruthfulQA MC is the average of all multiple-choice metrics on TruthfulQA; and Natural Questions metric is the Exact Match ratio on NQ.
Figure 2: The mean harmfulness score of MoLM-350M-K2 and LlamaMoE-3B-K2 for each dataset calculated by the Reward Model, Llama Guard, and OpenAI Content Moderation API. Lower scores indicate less harmful (safer) responses. Different colors for each model family: ( q .5 w .96 .4 .08 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .96 .4 .08 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) pythia ( q .5 w .96 .59 .16 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .96 .59 .16 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) MoLM ( q .5 w 0 .7 .8 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q 0 .7 .8 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) OpenLlama ( q .5 w .11 .31 .54 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .11 .31 .54 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) LlamaMoE.
Figure 3: The routing difference between in-domain and OOD datasets for MoLM-350M-K2. We compute the L1 distance at each layer between routers of the same model when receiving in-domain and OOD samples. The results are the average distance between word-level and sentence-level benchmarks. Lighter colors indicate larger routing differences.
Figure 4: The mean harmfulness score of MoLM-350M-K2 and LlamaMoE-3B-K2 for each dataset mixed with safety samples, calculated by the Reward Model, Llama Guard, and OpenAI Content Moderation API. Lower scores indicate less harmful (safer) responses. Numbers in front of the bars refer to harmfulness score decrease compared to training without safety samples, larger decrease indicate better improvement. Different colors for each model family: ( q .5 w .96 .4 .08 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .96 .4 .08 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) pythia ( q .5 w .96 .59 .16 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .96 .59 .16 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) MoLM ( q .5 w 0 .7 .8 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q 0 .7 .8 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) OpenLlama ( q .5 w .11 .31 .54 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .11 .31 .54 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) LlamaMoE.
Figure 5: The mean harmfulness score of MoLM and LlamaMoE model families for each dataset mixed with safety samples, calculated by the Reward Model, Llama Guard, and OpenAI Content Moderation API. Lower scores indicate less harmful (safer) responses. Different colors for each model family: ( q .5 w .96 .4 .08 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .96 .4 .08 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) pythia ( q .5 w .96 .59 .16 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .96 .59 .16 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) MoLM ( q .5 w 0 .7 .8 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q 0 .7 .8 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) OpenLlama ( q .5 w .11 .31 .54 RG 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h S Q q .11 .31 .54 rg 0 0 m 4.5 0 l 4.5 4.5 l 0 4.5 l h f Q) LlamaMoE.
...and 2 more figures

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

TL;DR

Abstract

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (7)