Table of Contents
Fetching ...

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, Bruno Castro da Silva

TL;DR

RLHF aims to align large language models with human preferences by learning a reward model from human feedback and optimizing the model with reinforcement learning. The paper provides a principled, component-focused analysis of RLHF, emphasizing the reward model, its assumptions, and the implications of reward misspecification, data sparsity, and delayed feedback. It offers a comprehensive survey of RLHF literature, discusses practical training challenges (stability, hyperparameter sensitivity, and iteration), and examines alternatives that reduce reliance on reward models. The analysis highlights the concept of an oracular reward, the limitations of current reward modeling, and the need for uncertainty quantification and robust evaluation to ensure safer and more reliable aligned language behavior.

Abstract

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

TL;DR

RLHF aims to align large language models with human preferences by learning a reward model from human feedback and optimizing the model with reinforcement learning. The paper provides a principled, component-focused analysis of RLHF, emphasizing the reward model, its assumptions, and the implications of reward misspecification, data sparsity, and delayed feedback. It offers a comprehensive survey of RLHF literature, discusses practical training challenges (stability, hyperparameter sensitivity, and iteration), and examines alternatives that reduce reliance on reward models. The analysis highlights the concept of an oracular reward, the limitations of current reward modeling, and the need for uncertainty quantification and robust evaluation to ensure safer and more reliable aligned language behavior.

Abstract

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.
Paper Structure (63 sections, 25 equations, 5 figures, 2 tables)

This paper contains 63 sections, 25 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the RLHF procedure, illustrating the challenges encountered at each step. The paper conducts a detailed examination of these challenges, providing valuable insights into each stage of the procedure.
  • Figure 2: Text generation from LLMs modeled as a Markov decision process. The generation process is auto-regressive, utilizing the token output (action) from the previous time step and the context (state) as input to produce the next token through the language model (policy). Given a context $c$, the language model produces the token $o_1$ at the first timestep. A concatenation of the two $[c, o_1]$ forms the input to the policy at the next timestep (Table \ref{['tbl:formulation']}). A reward function scores the generated output for a given context.
  • Figure 3: The reward model tends to misgeneralize for inputs not found in its training data, i.e., for $(c, o) \notin \mathcal{D}_{\text{rew}}$. This occurs in two ways: 1) when the context is not sampled by the prompting distribution for generating output and receiving feedback on (represented by $\kappa$), and 2) when the support of the output generating distribution---the language model---for a context does not span all possible outputs (represented by $\rho$). The latter is depicted in this figure.
  • Figure 4: Categorization of different components in the RLHF and example representative works from literature.
  • Figure 5: Workflow of RLHF. A pretraining phase, and optionally supervised finetuning (SFT) on human demonstrations, is followed by all RLHF workflows for training language models. This is followed by an iterative loop starting with collecting human feedback on model-generated outputs, training a reward model, and updating the language model using a suitable RL algorithm.