Table of Contents
Fetching ...

MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee, Hyunwoo J. Kim

Abstract

Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

Abstract

Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

Paper Structure

This paper contains 11 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison between top-$K$ routing and our MoE-GRPO. (a) While the top-$K$ routing deterministically selects $K$ experts based on gating scores, (b) MoE-GRPO stochastically samples $K$ experts across multiple rollouts and optimizes the expert selection policy through reward-based feedback.
  • Figure 2: Overall pipeline of MoE-GRPO. Given an input image (or video) and a question, denoted as $\boldsymbol{x}$, the rollout module $g_\text{old}$ samples $G$ expert routing policies, i.e., $\{\boldsymbol{E}^i\}_{i=1}^G \sim g_\text{old}(\boldsymbol{E}|\boldsymbol{x})$, where each policy $\boldsymbol{E}^i$ represents a sequence of expert selections across layers. Under each rollout $\boldsymbol{E}^i$, the model generates an output token sequence $\boldsymbol{y}^i$, and a corresponding reward $R^i$ is computed by the reward function. The relative reward of each rollout is evaluated within its group to derive the advantage value $\hat{A}^i$, which guides the policy update toward higher-reward expert combinations. To jointly optimize token-level generation and layer-wise expert routing, the overall training objective of MoE-GRPO consists of two sub-objectives: Token-GRPO, which optimizes token-level generation quality, and Gate-GRPO, which refines layer-wise expert selection through the gating network.
  • Figure 3: Training curves. (a) and (b) present the mean and standard deviation of the accuracy reward of MoE-GRPO, comparing our modality-aware router guidance with the modality-agnostic (multi.) expert selection baseline.
  • Figure 4: Token-level expert utilization ratio. Under MoE-GRPO, expert activation is more evenly distributed across the token sequence, resulting in more balanced expert utilization.
  • Figure 5: Expert utilization ratio (x-axis) for each task (y-axis). MoE-GRPO enhances task-level expert specialization by inducing more diverse expert activation patterns across tasks.
  • ...and 1 more figures