Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking

Lingling Fu

Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking

Lingling Fu

TL;DR

Reward models in RLHF are vulnerable to reward hacking, especially at small scales. The authors present an upcycled and merged MoE approach that inserts a shared expert for general knowledge, applies routing-weight normalization, and then merges the MoE back into a dense form with a learnable weight-averaging mechanism controlled by a shared-expert rate. Across AlpacaFarm and multiple base models, this method reduces reward hacking and outperforms both dense baselines and ensemble approaches while lowering compute demands. The approach demonstrates robust, scalable improvements for safe and efficient RLHF training.

Abstract

Reward models play a critical role in Reinforcement Learning from Human Feedback (RLHF) by assessing the consistency between generated outputs and human preferences. However, conventional reward models are prone to reward hacking or over-optimization, where the policy exploits shortcut patterns to obtain high reward scores that do not reflect true human preference. Although Mixture-of-Experts (MoE)-based reward models can enhance discriminative capability, they typically introduce substantial computational overhead. To address these challenges, we propose an upcycle and merge MoE reward modeling approach. We first upcycle a dense reward model into a MoE architecture, where a shared expert captures general knowledge, while normal experts specialize in instruction-specific patterns. We then apply routing-weight normalization and merge experts back into a dense model through a learnable weight-averaging mechanism, preserving performance gains while significantly reducing inference cost. Experimental results demonstrate that our method effectively mitigates reward hacking across various model scales. Our work highlights the potential of upcycle and merge MoE structures for improving both robustness and efficiency of RLHF reward models.

Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking

TL;DR

Abstract

Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)