Table of Contents
Fetching ...

DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling

Shanghaoran Quan

TL;DR

DMoERM addresses two core RM challenges in RLHF: multi-task disturbance across diverse data and noise from imperfect human annotations. It introduces a double-layer MoE architecture with a sparse outer router that directs inputs to task-specific inner MoEs, where LoRA-fine-tuned capability-point experts are individually trained and then aggregated by an MLP. Capability-point labels are obtained using a public LLM API to reduce annotation cost while maintaining performance, and the approach demonstrates superior consistency with human preferences and reduced overoptimization compared with state-of-the-art RM ensembling. The work provides extensive experiments across tasks, model sizes, and optimization regimes (BoN and PPO), along with data/code availability to support replication and further research.

Abstract

The performance of the reward model (RM) is a critical factor in improving the effectiveness of the large language model (LLM) during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only $60\%$ to $75\%$, causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the Double-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code and dataset are available at: https://github.com/quanshr/DMoERM-v1.

DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling

TL;DR

DMoERM addresses two core RM challenges in RLHF: multi-task disturbance across diverse data and noise from imperfect human annotations. It introduces a double-layer MoE architecture with a sparse outer router that directs inputs to task-specific inner MoEs, where LoRA-fine-tuned capability-point experts are individually trained and then aggregated by an MLP. Capability-point labels are obtained using a public LLM API to reduce annotation cost while maintaining performance, and the approach demonstrates superior consistency with human preferences and reduced overoptimization compared with state-of-the-art RM ensembling. The work provides extensive experiments across tasks, model sizes, and optimization regimes (BoN and PPO), along with data/code availability to support replication and further research.

Abstract

The performance of the reward model (RM) is a critical factor in improving the effectiveness of the large language model (LLM) during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only to , causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the Double-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code and dataset are available at: https://github.com/quanshr/DMoERM-v1.
Paper Structure (39 sections, 11 equations, 8 figures, 14 tables)

This paper contains 39 sections, 11 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: The outer MoE routes inputs to corresponding task-specific inner MoE.
  • Figure 2: The results of consistency study.
  • Figure 3: The training framework of each inner layer MoE. The LoRA components in the figure is only for illustration, as in actual experiments we will inject the LoRA layers into each layer of the transformers. Training details are in Section \ref{['trainingdetail']}.
  • Figure 4: The progress of the model at different training stages. The horizontal axis of each image represents the number of training steps, and the vertical axis represents the accuracy of sorting pairs of responses on the training and testing set. Figure \ref{['fig:sub1']} shows the results of the training Phase 1. Figures from \ref{['fig:sub2.1']} to \ref{['fig:sub2.6']} show the results of the training Phase 2. Figure \ref{['fig:sub3']} (top-right) shows the results of the training Phase 3.
  • Figure 5: The optimization results for BoN and PPO for the roleplay task. The x-axes have a square-root scale, and the KL divergence scale differs between BoN and PPO due to differences in the algorithm and the KL calculation. All RMs will be normalized to have a zero mean after training.
  • ...and 3 more figures