Table of Contents
Fetching ...

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu

TL;DR

MA-RLHF addresses the credit assignment bottleneck in token-level RLHF by introducing macro actions that operate over token sequences, enabling temporally extended decisions via a semi-Markov framework. The approach, MA-PPO, uses macro-action priors, multiple termination strategies, and an adapted PPO objective to learn over higher-level language constructs without extra compute. Across TL;DR, HH-RLHF, WebGPT, and APPS, MA-PPO consistently improves reward-model scores and evaluation win rates, with 1.7x–2x faster training toward parity with vanilla RLHF, and scales effectively from 2B to 27B parameters. The work contributes practical guidance on macro-action termination, ablations over macro-length, and analysis of generalization and stability, advancing scalable, human-aligned LLM training.

Abstract

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to preferred outcomes. This hinders learning efficiency and slows convergence.In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7 ~ 2 times faster in terms of training time and continues to outperform it with further training. We make our code and data publicly available at https://github.com/ernie-research/MA-RLHF.

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

TL;DR

MA-RLHF addresses the credit assignment bottleneck in token-level RLHF by introducing macro actions that operate over token sequences, enabling temporally extended decisions via a semi-Markov framework. The approach, MA-PPO, uses macro-action priors, multiple termination strategies, and an adapted PPO objective to learn over higher-level language constructs without extra compute. Across TL;DR, HH-RLHF, WebGPT, and APPS, MA-PPO consistently improves reward-model scores and evaluation win rates, with 1.7x–2x faster training toward parity with vanilla RLHF, and scales effectively from 2B to 27B parameters. The work contributes practical guidance on macro-action termination, ablations over macro-length, and analysis of generalization and stability, advancing scalable, human-aligned LLM training.

Abstract

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to preferred outcomes. This hinders learning efficiency and slows convergence.In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7 ~ 2 times faster in terms of training time and continues to outperform it with further training. We make our code and data publicly available at https://github.com/ernie-research/MA-RLHF.
Paper Structure (44 sections, 5 equations, 25 figures, 12 tables, 1 algorithm)

This paper contains 44 sections, 5 equations, 25 figures, 12 tables, 1 algorithm.

Figures (25)

  • Figure 1: Illustration of the MA-RLHF optimization framework. Standard RLHF makes decisions and evaluates value scores at the token level, while MA-RLHF makes decisions over sequences of tokens at a coarser temporal scale.
  • Figure 2: Test RM scores of Gemma-2B and Gemma-7B models on the TL;DR dataset. The shaded regions represent the standard deviation on test RM scores across training runs.
  • Figure 3: RM score distribution for PPO and MA-PPO (2B) at final steps (4.6k) on TL;DR.
  • Figure 4: Win rates of MA-PPO against vanilla PPO on TL;DR (left), HH-RLHF (middle) and WebGPT Comparisons (right), estimated by GPT-4 and Human.
  • Figure 5: Performance of MA-PPO with various macro action termination strategies on the TL;DR dataset using Gemma-2B. Left: Test RM scores for different termination strategies. Right: GPT-4 evaluation across four dimensions -- relevance, coherence, consistency, and fluency -- comparing different MA termination methods.
  • ...and 20 more figures