Table of Contents
Fetching ...

Rethinking the Role of Proxy Rewards in Language Model Alignment

Sungdong Kim, Minjoon Seo

TL;DR

This paper aims to replicate the ground truth (gold) reward signal by achieving a monotonic relationship between the proxy and gold reward signals after training the model using the proxy reward in reinforcement learning (RL).

Abstract

Learning from human feedback via proxy reward modeling has been studied to align Large Language Models (LLMs) with human values. However, achieving reliable training through that proxy reward model (RM) is not a trivial problem, and its behavior remained as a black-box. In this paper, we study the role of proxy rewards in the LLM alignment via `reverse reward engineering' by composing interpretable features as a white-box reward function. We aim to replicate the ground truth (gold) reward signal by achieving a monotonic relationship between the proxy and gold reward signals after training the model using the proxy reward in reinforcement learning (RL). Our findings indicate that successfully emulating the gold reward requires generating responses that are relevant with enough length to open-ended questions, while also ensuring response consistency in closed-ended questions. Furthermore, resulting models optimizing our devised white-box reward show competitive performances with strong open-source RMs in alignment benchmarks. We highlight its potential usage as a simple but strong reward baseline for the LLM alignment, not requiring explicit human feedback dataset and RM training. Our code is available at https://github.com/naver-ai/rethinking-proxy-reward.

Rethinking the Role of Proxy Rewards in Language Model Alignment

TL;DR

This paper aims to replicate the ground truth (gold) reward signal by achieving a monotonic relationship between the proxy and gold reward signals after training the model using the proxy reward in reinforcement learning (RL).

Abstract

Learning from human feedback via proxy reward modeling has been studied to align Large Language Models (LLMs) with human values. However, achieving reliable training through that proxy reward model (RM) is not a trivial problem, and its behavior remained as a black-box. In this paper, we study the role of proxy rewards in the LLM alignment via `reverse reward engineering' by composing interpretable features as a white-box reward function. We aim to replicate the ground truth (gold) reward signal by achieving a monotonic relationship between the proxy and gold reward signals after training the model using the proxy reward in reinforcement learning (RL). Our findings indicate that successfully emulating the gold reward requires generating responses that are relevant with enough length to open-ended questions, while also ensuring response consistency in closed-ended questions. Furthermore, resulting models optimizing our devised white-box reward show competitive performances with strong open-source RMs in alignment benchmarks. We highlight its potential usage as a simple but strong reward baseline for the LLM alignment, not requiring explicit human feedback dataset and RM training. Our code is available at https://github.com/naver-ai/rethinking-proxy-reward.
Paper Structure (37 sections, 2 equations, 15 figures, 8 tables)

This paper contains 37 sections, 2 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: A preview of our reverse reward engineering experiment. First, we design white-box reward functions with interpretable features such as the length or relevance of the response. Then, we conduct RL training using each of the designed functions as a proxy reward and deem it a success in reverse engineering if a monotonic relationship between the proxy and the ground truth (Gold) reward scores is observed in the multiple evaluations. The reverse-engineered reward (blue) exhibits such a tendency, whereas the length-only reward (green) does not achieve the monotonic relationship, showing reward overoptimization gao2023scaling.
  • Figure 2: An overview of reverse reward engineering study. It aims to imitate the ground-truth reward signal by Gold RM with white-box reward features such as length, repetition, and relevance. Specifically, we try to observe the monotonic relationship between the proxy and gold reward signals across the multiple evaluations during RL training. We could comprehend the roles of Gold RM via the interpretable features from the study.
  • Figure 3: Results of reverse reward engineering. We visualize normalized proxy and gold reward scores for every 500 PPO steps against each reward design option. The results on the upper side are from Anthropic-HH bai2022training, and the results on the lower side are from AlpacaFarm dubois2023alpacafarm, respectively. Instances of AlpacaEval alpaca_eval are used to compute the reward scores. We expect a monotonical relationship between the proxy and gold reward scores to achieve success in reverse engineering. We find that considering the relevance and adopting different rewards according to query type, i.e., $\texttt{RER}$, contribute to increasing the gold reward reliably.
  • Figure 4: Gold reward scores according to whether query type requires open-ended (OE) or closed-ended (CE) responses. We compare two proxy reward options, $\texttt{LI} \cdot \texttt{RP} \cdot \texttt{QR}$ and $\texttt{RER}$, based on models trained with Anthropic-HH bai2022training. We find they show meaningful differences in CE type, demonstrating the importance of the reward branching.
  • Figure 5: Self-BLEU of the PPO models on AlpacaEval according to query types. Please note that the higher Self-BLEU indicates lower diversity for the responses, i.e., consistent responses, and vice versa. As described in touvron2023llama2, we intend the high Self-BLEU for the CE type and low Self-BLEU for the OE type.
  • ...and 10 more figures