Table of Contents
Fetching ...

Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks

Andrei Chernov

TL;DR

This work probes post-evaluation expert contributions in a 16-layer MoE LLM on the MMLU quiz benchmark, focusing on how many experts activate, the shape of gating distributions, and per-expert accuracy. By analyzing gating alphas across 14,042 questions, the authors find that most experts remain inactive, gating outputs are near uniform (with increasing entropy from early to late layers), and certain experts consistently outperform others. These findings suggest opportunities for pruning inactive experts and reweighting gating to emphasize high-accuracy experts, potentially improving efficiency and performance. However, validation across more models and benchmarks is needed to generalize these insights and assess robustness.

Abstract

Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.

Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks

TL;DR

This work probes post-evaluation expert contributions in a 16-layer MoE LLM on the MMLU quiz benchmark, focusing on how many experts activate, the shape of gating distributions, and per-expert accuracy. By analyzing gating alphas across 14,042 questions, the authors find that most experts remain inactive, gating outputs are near uniform (with increasing entropy from early to late layers), and certain experts consistently outperform others. These findings suggest opportunities for pruning inactive experts and reweighting gating to emphasize high-accuracy experts, potentially improving efficiency and performance. However, validation across more models and benchmarks is needed to generalize these insights and assess robustness.

Abstract

Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.

Paper Structure

This paper contains 7 sections, 3 equations, 7 tables.