Table of Contents
Fetching ...

Aligning Language Models with Human Preferences via a Bayesian Approach

Jiashuo Wang, Haozhao Wang, Shichao Sun, Wenjie Li

TL;DR

The paper tackles aligning language models with human preferences by explicitly modeling disagreement using a Bayesian framework (d-PM) to approximate a universal preference across individuals. It then calibrates generation through a contrastive-learning objective that leverages d-PM scores, avoiding costly RL. Across emotional support conversations and integrity Rule-of-Thumb generation, d-PM-based calibration yields state-of-the-art automatic and human-evaluated improvements and better global consensus. This approach enables more universally acceptable and less controversial outputs, with efficient offline calibration suitable for smaller datasets.

Abstract

In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial. For this alignment, current popular methods leverage a reinforcement learning (RL) approach with a reward model trained on feedback from humans. However, inherent disagreements due to the subjective nature of human preferences pose a significant challenge for training the reward model, resulting in a deterioration of the NLG performance. To tackle this issue, previous approaches typically rely on majority voting or averaging to consolidate multiple inconsistent preferences into a merged one. Although straightforward to understand and execute, such methods suffer from an inability to capture the nuanced degrees of disaggregation among humans and may only represent a specialized subset of individuals, thereby lacking the ability to quantitatively disclose the universality of human preferences. To address this challenge, this paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model, and names it as d-PM. Besides, considering the RL strategy's inefficient and complex training process over the training efficiency, we further propose utilizing the contrastive learning strategy to train the NLG model with the preference scores derived from the d-PM model. Extensive experiments on two human-centric NLG tasks, i.e., emotional support conversation and integrity "Rule-of-Thumb" generation, show that our method consistently exceeds previous SOTA models in both automatic and human evaluations.

Aligning Language Models with Human Preferences via a Bayesian Approach

TL;DR

The paper tackles aligning language models with human preferences by explicitly modeling disagreement using a Bayesian framework (d-PM) to approximate a universal preference across individuals. It then calibrates generation through a contrastive-learning objective that leverages d-PM scores, avoiding costly RL. Across emotional support conversations and integrity Rule-of-Thumb generation, d-PM-based calibration yields state-of-the-art automatic and human-evaluated improvements and better global consensus. This approach enables more universally acceptable and less controversial outputs, with efficient offline calibration suitable for smaller datasets.

Abstract

In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial. For this alignment, current popular methods leverage a reinforcement learning (RL) approach with a reward model trained on feedback from humans. However, inherent disagreements due to the subjective nature of human preferences pose a significant challenge for training the reward model, resulting in a deterioration of the NLG performance. To tackle this issue, previous approaches typically rely on majority voting or averaging to consolidate multiple inconsistent preferences into a merged one. Although straightforward to understand and execute, such methods suffer from an inability to capture the nuanced degrees of disaggregation among humans and may only represent a specialized subset of individuals, thereby lacking the ability to quantitatively disclose the universality of human preferences. To address this challenge, this paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model, and names it as d-PM. Besides, considering the RL strategy's inefficient and complex training process over the training efficiency, we further propose utilizing the contrastive learning strategy to train the NLG model with the preference scores derived from the d-PM model. Extensive experiments on two human-centric NLG tasks, i.e., emotional support conversation and integrity "Rule-of-Thumb" generation, show that our method consistently exceeds previous SOTA models in both automatic and human evaluations.
Paper Structure (50 sections, 10 equations, 5 figures, 9 tables)

This paper contains 50 sections, 10 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: People can have different feelings towards the same response in the emotional support conversation because of their own experiences and values. A trustworthy human-centric system is expected to consider the benefits of universal groups, including minorities, and generate less controversial and more helpful content, like supporter B instead of A.
  • Figure 2: Diagram for preference modeling with disagreement and calibration for alignment.
  • Figure 3: The maximum and minimum preference scores of $10$ candidates generated via diverse beam search given the same context. We test on $1000$ data instances and three emotional support conversation models.
  • Figure 4: Model performances with different candidate numbers $K$ when calibrating MultiESC with preference scores of d-PM.
  • Figure 5: Comparison between alignment with RL (RL) and our model (Ours). Left: Automatic evaluation results (#(Samples)/s indicates the number of trained samples per second). Right: Training loss according to training steps.