Table of Contents
Fetching ...

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang

TL;DR

This work tackles reward sparsity in RLVR by leveraging heterogeneous multi-expert prompts and inter-expert mutual learning. It introduces MEML-GRPO, a two-stage framework combining Multi-Expert Fine-Tuning (MEF) and Reinforced Inter-Expert Learning (RIEL), including KL-divergence-based inter-expert transfer and a hard-example SFT buffer to sustain progress on difficult tasks. Empirical results across GSM8K, MathQA, and StrategyQA show MEML-GRPO delivering consistent gains over state-of-the-art RLVR methods for both Qwen and Llama models, with substantial average improvements and efficient inference via a single deployed model. The findings demonstrate that diverse expert prompts coupled with cross-expert learning can overcome reward sparsity and enhance robust reasoning in large language models.

Abstract

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

TL;DR

This work tackles reward sparsity in RLVR by leveraging heterogeneous multi-expert prompts and inter-expert mutual learning. It introduces MEML-GRPO, a two-stage framework combining Multi-Expert Fine-Tuning (MEF) and Reinforced Inter-Expert Learning (RIEL), including KL-divergence-based inter-expert transfer and a hard-example SFT buffer to sustain progress on difficult tasks. Empirical results across GSM8K, MathQA, and StrategyQA show MEML-GRPO delivering consistent gains over state-of-the-art RLVR methods for both Qwen and Llama models, with substantial average improvements and efficient inference via a single deployed model. The findings demonstrate that diverse expert prompts coupled with cross-expert learning can overcome reward sparsity and enhance robust reasoning in large language models.

Abstract

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

Paper Structure

This paper contains 25 sections, 11 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: This figure illustrates the pipeline of MEML-GRPO. The GRPO loss, which is computed across all experts, is omitted from the figure for brevity.
  • Figure 2: Training reward dynamics of MEML-GRPO (Llama3.2) compared with other on-policy RL.