Table of Contents
Fetching ...

Reinforcement Learning Enhanced LLMs: A Survey

Shuhe Wang, Shengyu Zhang, Jie Zhang, Runyi Hu, Xiaoya Li, Tianwei Zhang, Jiwei Li, Fei Wu, Guoyin Wang, Eduard Hovy

TL;DR

The paper addresses the fragmentation in RL-enhanced LLM research by systematically surveying RL foundations, popular RL-enhanced LLMs, RLHF and RLAIF approaches, and directly optimizing human preferences via DPO. It synthesizes how reward models are built, how human or AI feedback is integrated, and how direct preference methods bypass reward modeling to stabilize and accelerate alignment. Key contributions include a taxonomy of models and methods, analysis of challenges (OOD, interpretability, safety, evaluation), and a detailed examination of DPO variants and their safety considerations. The work highlights practical implications for researchers and practitioners seeking scalable, reliable alignment of LLMs with human preferences and safety constraints. Overall, the survey clarifies current state-of-the-art techniques and points to promising directions for more robust, efficient RL-based alignment in real-world LLM deployments.

Abstract

Reinforcement learning (RL) enhanced large language models (LLMs), particularly exemplified by DeepSeek-R1, have exhibited outstanding performance. Despite the effectiveness in improving LLM capabilities, its implementation remains highly complex, requiring complex algorithms, reward modeling strategies, and optimization techniques. This complexity poses challenges for researchers and practitioners in developing a systematic understanding of RL-enhanced LLMs. Moreover, the absence of a comprehensive survey summarizing existing research on RL-enhanced LLMs has limited progress in this domain, hindering further advancements. In this work, we are going to make a systematic review of the most up-to-date state of knowledge on RL-enhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements. Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements. Project page of this work can be found at https://github.com/ShuheWang1998/Reinforcement-Learning-Enhanced-LLMs-A-Survey.

Reinforcement Learning Enhanced LLMs: A Survey

TL;DR

The paper addresses the fragmentation in RL-enhanced LLM research by systematically surveying RL foundations, popular RL-enhanced LLMs, RLHF and RLAIF approaches, and directly optimizing human preferences via DPO. It synthesizes how reward models are built, how human or AI feedback is integrated, and how direct preference methods bypass reward modeling to stabilize and accelerate alignment. Key contributions include a taxonomy of models and methods, analysis of challenges (OOD, interpretability, safety, evaluation), and a detailed examination of DPO variants and their safety considerations. The work highlights practical implications for researchers and practitioners seeking scalable, reliable alignment of LLMs with human preferences and safety constraints. Overall, the survey clarifies current state-of-the-art techniques and points to promising directions for more robust, efficient RL-based alignment in real-world LLM deployments.

Abstract

Reinforcement learning (RL) enhanced large language models (LLMs), particularly exemplified by DeepSeek-R1, have exhibited outstanding performance. Despite the effectiveness in improving LLM capabilities, its implementation remains highly complex, requiring complex algorithms, reward modeling strategies, and optimization techniques. This complexity poses challenges for researchers and practitioners in developing a systematic understanding of RL-enhanced LLMs. Moreover, the absence of a comprehensive survey summarizing existing research on RL-enhanced LLMs has limited progress in this domain, hindering further advancements. In this work, we are going to make a systematic review of the most up-to-date state of knowledge on RL-enhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements. Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements. Project page of this work can be found at https://github.com/ShuheWang1998/Reinforcement-Learning-Enhanced-LLMs-A-Survey.

Paper Structure

This paper contains 90 sections, 1 equation, 29 figures, 1 table.

Figures (29)

  • Figure 1: An example of the full process of RL. Training Objective: The goal is to train a robot to navigate from the bottom-left corner of a square to the top-right corner. Each grid cell is assigned a reward score, and the objective is to maximize the robot’s overall score. General Pipeline of RL: The agent begins in an initial state $s_0$, and at each time step $t$, it selects an action $a_{t}$ based on its current state $s_{t}$. In response, the environment transitions to a new state $s_{t+1}$, and the agent receives a reward $r_{t}$.
  • Figure 2: The framework of RL for LLMs proposed by ouyang2022training.
  • Figure 3: The composition of the Skywork-Reward. The figure is copied from liu2024skywork.
  • Figure 4: Magpie self-synthesizes data from aligned LLMs. The figure is borrowed from xu2024magpie.
  • Figure 5: Identified bias types and examples in OffsetBias. The figure is borrowed from park2024offsetbias.
  • ...and 24 more figures