Table of Contents
Fetching ...

Reward Modeling with Weak Supervision for Language Models

Ben Hauptvogel, Malte Ostendorff, Georg Rehm, Sebastian Möller

TL;DR

This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance, and uses an LLM to generate and then weakly label responses offers a promising method for extending preference data.

Abstract

Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.

Reward Modeling with Weak Supervision for Language Models

TL;DR

This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance, and uses an LLM to generate and then weakly label responses offers a promising method for extending preference data.

Abstract

Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.

Paper Structure

This paper contains 21 sections, 1 equation, 6 figures, 21 tables.

Figures (6)

  • Figure 1: Extending RLHF datasets with weak supervision in a three-step pipeline: conducting data analysis, writing labeling functions, applying a label model to create a new weakly labeled dataset.
  • Figure 2: Evaluation for HH-RLHF using 10% (left) and 5% (right) as a baseline train set.
  • Figure 3: Evaluation for HH-RLHF using 2% (left) and 1% (right) as a baseline train set.
  • Figure 4: Evaluation for UB dataset using 10% (upper left), 5% (upper right), and 2% (bottom) as baseline train set.
  • Figure 5: Evaluation for UBP using 10% (upper left), 5% (upper right), and 2% (bottom) as baseline train set.
  • ...and 1 more figures