Scalable Ensembling For Mitigating Reward Overoptimisation

Ahmed M. Ahmed; Rafael Rafailov; Stepan Sharkov; Xuechen Li; Sanmi Koyejo

Scalable Ensembling For Mitigating Reward Overoptimisation

Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, Sanmi Koyejo

TL;DR

The paper tackles reward overoptimisation in RLHF for large language models by proposing a scalable ensemble method that uses a shared encoder with multiple linear reward heads and a minimum operator to create a pessimistic yet efficient reward signal. Through PPO-based RLHF experiments on the AlpacaFarm dataset, the proposed multi-head reward model achieves comparable performance to a full ensemble while substantially reducing memory and training time. The findings include favorable calibration properties and guidance that a small number of heads (around three) may be optimal. This approach enables more scalable and robust alignment of instruction-following models with human preferences, without the prohibitive costs of full ensemble methods.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has enabled significant advancements within language modeling for powerful, instruction-following models. However, the alignment of these models remains a pressing challenge as the policy tends to overfit the learned ``proxy" reward model past an inflection point of utility as measured by a ``gold" reward model that is more performant -- a phenomenon known as overoptimisation. Prior work has mitigated this issue by computing a pessimistic statistic over an ensemble of reward models, which is common in Offline Reinforcement Learning but incredibly costly for language models with high memory requirements, making such approaches infeasible for sufficiently large models. To this end, we propose using a shared encoder but separate linear heads. We find this leads to similar performance as the full ensemble while allowing tremendous savings in memory and time required for training for models of similar size.

Scalable Ensembling For Mitigating Reward Overoptimisation

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 6 figures, 4 tables)

This paper contains 19 sections, 3 equations, 6 figures, 4 tables.

Introduction
Related Work
Background & Methods
PPO
Reward Learning
Multi-head Reward Learning
Experiments
Datasets and Motivation
Methodology and Tools
Results
Calibration
Calibration Analysis and Implications
Conclusion
Acknowledgements
Appendix
...and 4 more sections

Figures (6)

Figure 1: Comparison of Reward Modeling methods to ours (right)
Figure 2: Gold analysis on top, Proxy metrics below.
Figure 3: Multi-head model calibration over different objectives with respect to the probabilities (taking min, max over ensemble etc.)
Figure 4: Gold analysis.
Figure 5: Proxy metrics.
...and 1 more figures

Scalable Ensembling For Mitigating Reward Overoptimisation

TL;DR

Abstract

Scalable Ensembling For Mitigating Reward Overoptimisation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)