Table of Contents
Fetching ...

Adversarial Training of Reward Models

Alexander Bukharin, Haifeng Qian, Shengyang Sun, Adithya Renduchintala, Soumye Singhal, Zhilin Wang, Oleksii Kuchaiev, Olivier Delalleau, Tuo Zhao

TL;DR

Reward models used for RLHF can misalign due to poor out-of-distribution generalization, enabling reward hacking. Adv-RM introduces a reinforcement-learning–driven adversarial policy that constructs high-reward, out-of-distribution responses and uses them to adversarially augment RM training, improving robustness. Across synthetic and real RLHF benchmarks, Adv-RM achieves high attack success on state-of-the-art RMs and yields more stable training with better downstream alignment, with two adversarial training rounds proving most effective. The approach advances scalable alignment by mitigating reward hacking, albeit at increased computational cost and with reliance on ensemble-based OOD signals.

Abstract

Reward modeling has emerged as a promising approach for the scalable alignment of language models. However, contemporary reward models (RMs) often lack robustness, awarding high rewards to low-quality, out-of-distribution (OOD) samples. This can lead to reward hacking, where policies exploit unintended shortcuts to maximize rewards, undermining alignment. To address this challenge, we introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples -- responses that receive high rewards from the target RM but are OOD and of low quality. By leveraging reinforcement learning, Adv-RM trains a policy to generate adversarial examples that reliably expose vulnerabilities in large state-of-the-art reward models such as Nemotron 340B RM. Incorporating these adversarial examples into the reward training process improves the robustness of RMs, mitigating reward hacking and enhancing downstream performance in RLHF. We demonstrate that Adv-RM significantly outperforms conventional RM training, increasing stability and enabling more effective RLHF training in both synthetic and real-data settings.

Adversarial Training of Reward Models

TL;DR

Reward models used for RLHF can misalign due to poor out-of-distribution generalization, enabling reward hacking. Adv-RM introduces a reinforcement-learning–driven adversarial policy that constructs high-reward, out-of-distribution responses and uses them to adversarially augment RM training, improving robustness. Across synthetic and real RLHF benchmarks, Adv-RM achieves high attack success on state-of-the-art RMs and yields more stable training with better downstream alignment, with two adversarial training rounds proving most effective. The approach advances scalable alignment by mitigating reward hacking, albeit at increased computational cost and with reliance on ensemble-based OOD signals.

Abstract

Reward modeling has emerged as a promising approach for the scalable alignment of language models. However, contemporary reward models (RMs) often lack robustness, awarding high rewards to low-quality, out-of-distribution (OOD) samples. This can lead to reward hacking, where policies exploit unintended shortcuts to maximize rewards, undermining alignment. To address this challenge, we introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples -- responses that receive high rewards from the target RM but are OOD and of low quality. By leveraging reinforcement learning, Adv-RM trains a policy to generate adversarial examples that reliably expose vulnerabilities in large state-of-the-art reward models such as Nemotron 340B RM. Incorporating these adversarial examples into the reward training process improves the robustness of RMs, mitigating reward hacking and enhancing downstream performance in RLHF. We demonstrate that Adv-RM significantly outperforms conventional RM training, increasing stability and enabling more effective RLHF training in both synthetic and real-data settings.

Paper Structure

This paper contains 22 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) and (b) show $U_{\theta_1, \theta_2}(\cdot)$ versus gold score. (c) shows that $R_{\theta_1}$ is similar for the Adv-RM and SFT data, which helps isolate the relationship between $U_{\theta_1, \theta_2}(\cdot)$ and gold score.
  • Figure 2: Adversarial examples generated by Adv-RM for top RewardBench models. The Z-score is computed by normalizing the reward score by the average reward achieved by Llama-3.1-8b-Instruct for that prompt.
  • Figure 3: Attack transferability.
  • Figure 4: Downstream policy results in the synthetic setup. Error bars represent $\pm$ one standard deviation over three random seeds.
  • Figure 5: Downstream policy results with different judge models.
  • ...and 2 more figures