Table of Contents
Fetching ...

Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

Mengdi Li, Jiaye Lin, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, Di Wang

TL;DR

Curriculum-RLAIF tackles limited reward-generalization in RLAIF by introducing a data-centric curriculum that sequences preference data by difficulty. It combines quality-aware sampling (random and guided generations), diverse pair types (random, contrastive, bridging), and a progressive easy-to-hard curriculum with a dedicated reward-learning loss and PPO updates. Empirical results across harmlessness, helpfulness, and summarization tasks show substantial gains in policy alignment and reward generalization, with reduced data-labeling costs compared with non-curriculum baselines. The work highlights the value of leveraging data difficulty structure to improve RLHF/RLAIF alignment while maintaining efficiency and providing insights through ablations and visualizations.

Abstract

Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, $\textit{Curriculum-RLAIF}$, which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.

Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

TL;DR

Curriculum-RLAIF tackles limited reward-generalization in RLAIF by introducing a data-centric curriculum that sequences preference data by difficulty. It combines quality-aware sampling (random and guided generations), diverse pair types (random, contrastive, bridging), and a progressive easy-to-hard curriculum with a dedicated reward-learning loss and PPO updates. Empirical results across harmlessness, helpfulness, and summarization tasks show substantial gains in policy alignment and reward generalization, with reduced data-labeling costs compared with non-curriculum baselines. The work highlights the value of leveraging data difficulty structure to improve RLHF/RLAIF alignment while maintaining efficiency and providing insights through ablations and visualizations.

Abstract

Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, , which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.

Paper Structure

This paper contains 33 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Conceptual illustration of the Curriculum-RLAIF pipeline. (Top) The process begins with quality-aware sampling, combining random and guided strategies to generate responses with varying quality. (Middle) Next, controlled pairing constructs preference pairs exhibiting different difficulty levels based on quality differences. (Bottom) Finally, reward model learning is conducted using a curriculum that presents preference data in order of increasing difficulty (from light to dark gray).
  • Figure 2: Experimental results in the preliminary study: (a) relationship between preference labeling accuracy by a state-of-the-art LLM and confidence score; (b) relationship between reward score accuracy by a reward model obtained from conventional RLAIF and confidence score; (c) consistency between reward distance$\Delta r$ predicted by a pretrained reward model and confidence score.
  • Figure 3: Distribution visualization of the reward distance$\Delta r$ of each curriculum stage.
  • Figure 4: Comparison of reward score accuracy between the conventional RLAIF method lee_rlaif_2024 (in blue) and Curriculum-RLAIF (in orange) across various sample difficulty levels.
  • Figure 5: Distribution visualization of reward distance $\Delta r$ of each curriculum stage in $\mathcal{C}_{\text{brg}}$. The same pretrained large-scale reward model is utilized to calculate the reward distance for both methods.